
THE UNITED STATES PATENT AND TRADEMARK OFFICE 
Application of: Yu and Turner, Jr. 

Serial No. : 10/044,807 Group Art Unit: 1652 

Filed: 1/11/2002 Examiner: S. Swope 

For: Human Protease Polynucleotides and Attorney Docket No.: LEX-0298-USA 
Compositions Comprising the Same 
(As Amended) 



APPEAL BRIEF 



10/20/2003 HDEMESS1 00000081 500892 1004480? 
01 FC:2402 165.00 Dft 



Mail Stop Appeal Brief - Patents 
Commissioner for Patents 
P.O. Box 1450 
Alexandria, VA 22313-1450 



TABLE OF CONTENTS 



I. REAL PARTY IN INTEREST 1 

H. RELATED APPEALS AND INTERFERENCES 1 

m. STATUS OF THE CLAIMS 2 

IV. STATUS OF THE AMENDMENTS 2 

V. SUMMARY OF THE INVENTION 2-3 

VI. ISSUES ON APPEAL 3 

VH. GROUPING OF THE CLAIMS 4 

Vm. ARGUMENT 4-15 

A. Do Claims 1-4 Lack a Patentable Utility? 4-15 

B. Are Claims 1-4 Unusable Due to a Lack of Patentable Utility? 15 

IX. APPENDIX 16 

X. CONCLUSION 17 



APPEAL BRIEF 



Sir: 

Appellants hereby submit an original and two copies of this Appeal Brief to the Board of Patent 
Appeals and Interferences ("the Board") in response to the Final Office Action mailed on 
February 5, 2003. The Notice of Appeal was timely submitted on July 7, 2003, and was received in the 
Patent and Trademark Office ("the Office") on July 14, 2003. This Appeal Brief is timely submitted in light 
of the concurrently filed Petition for an Extension of Time of one month to and including October 14, 2003, 
and authorization to deduct the fee as required under 37 C.F.R. § 1.17(a)(1) from Appellants' 
Representatives ' deposit account. The Commissioner is also authorized to charge the fee for filing this 
Appeal Brief ($160.00), as required under 37 C.F.R. § 1.17(c), to Lexicon Genetics Incorporated Deposit 
Account No. 50-0892. 

Appellants believe no fees in addition to the fee for filing the Appeal Brief and the fee for the 
extension of time are due in connection with this Appeal Brief. However, should any additional fees under 
37C.F.R. §§ 1.16to 1.21 be required for any reason related to this communication, the Commissioner 
is authorized to charge any underpayment or credit any overpayment to Lexicon Genetics Incorporated 
Deposit Account No. 50-0892. 

I. REAL PARTY IN INTEREST 

The real party in interest is the Assignee, Lexicon Genetics Incorporated, 8 800 Technology Forest 
Place, The Woodlands, Texas, 77381. 

II. RELATED APPEALS AND INTERFERENCES 

Appellants know of no related appeals or interferences that will directly affect or be directly 
affected by or have a bearing on the Board's decision in the pending appeal. 
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III. STATUS OF THE CLAIMS 

The present application was filed on January 1 1 , 2002, claiming the benefit of U.S. Provisional 
Application Number 60/26 1 ,684, which was filed on January 12, 2001 , and included original claims 1 -3. 
A First Official Action on the merits ("the First Action") was issued on August 12, 2002, in which 
claims 1-3 were rejected under 35 U.S. C. § 101 as allegedly lacking a patentable utility, andclaims 1-3 
were rejected under 35 U.S. C. § 112, first paragraph, as allegedly unusable by the skilled artisan due to 
the alleged lack of patentable utility. In a response to the First Official Action submitted to the Office on 
November 12, 2002 ("Response to the First Action"), Appellants added new claim 4, and addressed the 
rejections of claims 1-3. 

A Second and Final Official Action ("the Final Action") was mailed on February 5, 2003, 
maintaining the rejection of claims 1-3 (and newly added claim 4) under 35 U.S.C. § 101 as allegedly 
lacking a patentable utility, and under 35 U.S .C. § 1 12, first paragraph, as allegedly unusable by the skilled 
artisan due to the alleged lack of patentable utility. In a response to the Second and Final Office Action 
submitted on July 7, 2003 ("Response to the Final Action"), Appellants again addressed the rejections of 
claims 1-4. An Advisory Action ("the Advisory Action") was mailed on July 30, 2003, maintaining the 
rejection of claims 1-4 under 35 U.S.C. § 101 as allegedly lacking a patentable utility, and under 
35 U.S.C. § 1 12, first paragraph, as allegedly unusable by the skilled artisan due to the alleged lack of 
patentable utility. Therefore, claims 1-4 are the subject of this appeal. A copy of the appealed claims are 
included below in the Appendix (Section IX). 

IV. STATUS OF THE AMENDMENTS 

As no amendments subsequent to the Final Action have been filed, Appellants believe that no 
outstanding amendments exist. 

V. SUMMARY OF THE INVENTION 

The present invention relates to Appellants' discovery and identification of novel human 
polynucleotide sequences that encode novel proteins sharing sequence similarity with animal proteases, and 
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particular structural similarity to the ADAMTS family of metalloproteases (see, at least, the specification 
at page 2, lines 5-7, and page 17, lines 29-32). 

The presently claimed polynucleotide sequences were compiled from cDNAs prepared and 
isolated from human lymph node, kidney, and prostate mRNAs (specification at page 4, lines 7-8). A 
number of coding single nucleotide polymorphisms were identified in the claimed sequence - specifically: 
a C/G polymorphism at position 236 1 of SEQ ID NO: 1 , which can result in an aspartate or glutamate at 
amino acid position 787 of SEQ ID NO:2; a C/A polymorphism at position 2467 of SEQ ID NO: 1 , which 
can result in a leucine or isoleucine at amino acid position 823 of SEQ ID NO:2; a C/A polymorphism at 
position 26 13 of SEQ ID NO: 1 , both of which result in an isoleucine at corresponding aa position 87 1 of 
SEQ ID NO:2; a C/T polymorphism at position 3 14 1 of SEQ ID NO: 1 , both of which result in a serine 
at amino acid position 1047 of SEQ ID NO:2; a G/T polymorphism at position 3225 of SEQ ID NO: 1 , 
which can result in a glutamine or histidine at amino acid position 1075 of SEQ ID NO:2; a C/T 
polymorphism at position 3226 of SEQ ID NO: 1 , which can result in an arginine or tryptophan at amino 
acid position 1076 of SEQ ID NO:2; and an A/G polymorphism at position 4226 of SEQ ID NO: 1 , which 
can result in an aspartate or glycine at amino acid position 1409 of SEQ ID NO:2. 

The specification details a number of uses for the presently claimed polynucleotide sequences, 
including in forensic analysis (see, for example, the specification at page 3, line 15, and from page 1 1 , 
line 31 to page 12, line 27), in the identification of protein coding sequence (see, for example, the 
specification at page 3, lines 5-7), in the identification of exon splice junctions (see, for example, the 
specification at page 3, lines 10-1 1), in mapping the sequences to a specific region of a human chromosome 
(see, for example, the specification at page 3, lines 7-10), and in assessing gene expression patterns, 
particularly using a high throughput "chip" format (see, for example, the specification at page 6, 
lines 16-18). 

VI. ISSUES ON APPEAL 

1 . Do claims 1-4 lack a patentable utility? 

2. Are claims 1-4 unusable by a skilled artisan due to a lack of patentable utility? 
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VIL GROUPING OF THE CLAIMS 

Forthe purposes of the outstanding rejections under 35U.S.C. § 101 and35U.S.C. § 112, first 
paragraph, the claims will stand or fall together. 

VIII. ARGUMENT 

A. Do Claims 1-4 Lack a Patentable Utility? 

The Final Action first rejects claims 1-4 under 35 U.S.C. § 101, as allegedly lacking a patentable 
utility due to not being supported by either a specific and substantial or a well-established utility. 

Appellants pointed out both in the Response to the First Action and the Response to the Final 
Action that the present nucleic acid sequences have utility in forensic analysis, as described in the 
specification as originally filed (see, for example, the specification as originally filed, at least at page 3 , 
line 15, and from page 11, line 31 to page 12, line 27). As described in the specification at page 18, 
lines 3-27, the present sequences define a number of coding single nucleotide polymorphisms - specifically: 
a C/G polymorphism at position 236 1 of SEQ ID NO: 1 , which can result in an aspartate or glutamate at 
amino acid position 787 of SEQ ID NO:2; a C/A polymorphism at position 2467 of SEQ ID NO: 1 , which 
can result in a leucine or isoleucine at amino acid position 823 of SEQ ID NO:2; a C/A polymorphism at 
position 26 1 3 of SEQ ID NO: 1 , both of which result in an isoleucine at corresponding aa position 87 1 of 
SEQ ID NO:2; a C/T polymorphism at position 3 14 1 of SEQ ID NO: 1 , both of which result in a serine 
at amino acid position 1047 of SEQ ID NO:2; a G/T polymorphism at position 3225 of SEQ ID NO: 1 , 
which can result in a glutamine or histidine at amino acid position 1075 of SEQ ID NO:2; a C/T 
polymorphism at position 3226 of SEQ ID NO: 1 , which can result in an arginine or tryptophan at amino 
acid position 1076 of SEQ ID NO:2; and an A/G polymorphism at position 4226 of SEQ ID NO: 1 , which 
can result in an aspartate or glycine at amino acid position 1409 of SEQ ID NO:2. As such polymorphisms 
are the basis for forensic analysis, which in undoubtedly a "real world" utility, the presently claimed 
sequence must in itself be useful. 

Appellants respectfully point out that the presently described polymorphisms are useful in forensic 
analysis exactly as they were described in the specification as originally filed - specifically, to distinguish 
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individual members of the human population from one another based simply on the presence or absence 
of one or more of the described polymorphisms. The skilled artisan would be able to use the presently 
described polymorphisms in forensic analysis exactly as they were described in the specification as 
originally filed, without any additional research. It is important to note that simply because the use of these 
polymorphic markers will necessarily provide additional information on the percentage of particular 
subpopulations that contain these polymorphic markers does not mean that additional research is needed 
in order for these markers as they are presently described in the instant specification to be used in forensic 
science. 

This is also not a case of a potential utility. Even in the worst case scenario, the described 
polymorphisms are each useful to distinguish 50% of the population (in other words, the marker being 
present in half of the population). Appellants point out that the ability of a polymorphic marker to 
distinguish at least 50% of the population is an inherent feature of any polymorphic marker, and this feature 
is well understood by those of skill in the art. Appellants note that as a matter of law, it is well settled that 
a patent need not disclose what is well known in the art. In re Wands, 8 USPQ 2d 1400 (Fed. Cir. 1988). 
Appellants respectfully point out that all that is required to support Appellants' assertion of utility is for the 
skilled artisan to believe that the presently described polymorphic markers could be useful in forensic 
analysis. The fact that forensic biologists use polymorphic markers such as those described by Appellants 
everyday provides more than ample support for the assertion that forensic biologists would also be able 
to use the specific polymorphic markers described by Appellants in the same fashion. Therefore, the 
presently claimed sequence clearly has a substantial and well established utility. 

The Examiner questioned this asserted utility, stating "the presence of polymorphisms in human 
DNA is well established and virtually any locus on a human chromosome will exhibit one or more 
polymorphisms which could be so used" (the Final Action at page 2). This argument is flawed in a number 
of respects. First, until a polymorphic marker is actually described it cannot be used in forensic analysis. 
Put another way, simply because there is a likelihood, even a significant likelihood, that a particular nucleic 
acid sequence will contain a polymorphism and thus be useful in forensic analysis, until such a polymoiphism 
is actually identified and described, such a likelihood is meaningless . The Examiner appears to be 
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attempting to use the information presented for the first time by Appellants in the instant specification as 

hindsight verification that the presently claimed sequence would be expected to have polymorphic markers. 

Such hindsight analysis based on Appellants discovery is completely improper. Second, the Examiner 

seems to be confusing the requirements of a specific utility with a unique utility. The fact that other 

polymorphic markers have been identified in other genetic loci, or that the use of the presently described 

polymorphic markers will provide additional information concerning the prevalence of these markers in 

certain subpopulations, does not mean that use of the polymorphic markers identified by Appellants' in 

forensic analysis is not a specific utility. As clearly stated by the Federal Circuit in Carl Zeiss Stiftung v. 

Renishaw PLC, 20 USPQ2d 1101 (Fed. Cir. 1991): 

An invention need not be the best or only way to accomplish a certain result, and it need 
only be useful to some extent and in certain applications: "[T]he fact that an invention has 
only limited utility and is only operable in certain applications is not grounds for finding a 
lack of utility." Envirotech Corp. v. Al George, Inc., 221 USPQ 473, 480 (Fed. Cir. 
1984) 

In other words, just because other (possibly better) polymorphic markers from the human genome have 
been described, or that additional information about the presently described polymorphic markers can be 
gained through the use of these markers, does not establish that the presently described polymorphic 
markers lack a specific utility. If every invention were required to have a unique utility, the Patent and 
Trademark Office would no longer be issuing patents on batteries, automobile tires, golf balls, golf clubs, 
and treatments for a variety of human diseases, such as cancer, just to name a few particular examples, 
because the utility of each of these compositions is applicable to the broad class in which each of these 
compositions falls: all batteries have the same utility, specifically to provide electrical power; all automobile 
tires have the same utility, specifically for use on automobiles; all golf balls and golf clubs have the same 
utility, specifically for use in the game of golf; and all cancer treatments have the same utility, specifically, 
to treat cancer. However, only the briefest perusal of virtually any issue of the Official Gazette provides 
numerous examples of patents being granted on each of the above compositions nearly every week . 
Furthermore, if a composition needed to be unique to be patented, the entire class and subclass system 
would be an effort in futility, as the class and subclass system serves solely to group such common 
inventions, which would not be required if each invention needed to have a unique utility. In view of the 
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above standards and "common sense" analysis, there can be little question that the present sequence clearly 

meets the requirements of 35 U.S.C. § 101. 

Furthermore, as the presently described polymorphisms are a part of the family of polymorphisms 

that have a well established utility, the Federal Circuit' s holding in In re Brana, (34 USPQ2d 1436 (Fed. 

Cir. 1995), "Brand") is directly on point. In Brana, the Federal Circuit admonished the Patent and 

Trademark Office for confusing "the requirements under the law for obtaining a patent with the requirements 

for obtaining government approval to market a particular drug for human consumption". Brana at 1442. 

The Federal Circuit went on to state: 

At issue in this case is an important question of the legal constraints on patent office 
examination practice and policy. The question is, with regard to pharmaceutical inventions, 
what must the applicant provide regarding the practical utility or usefulness of the invention 
for which patent protection is sought. This is not a new issue: it is one which we would 
have thought had been settled by case law years ago . 

Brana at 1439, emphasis added. The choice of the phrase "utility or usefulness" in the foregoing quotation 

is highly pertinent. The Federal Circuit is evidently using "utility" to refer to rejections under 

35U.S.C. § 101, and is using "usefulness" to referto rejections under 35 U.S.C. § 112, first paragraph. 

This is made evident in the continuing text in Brana, which explains the correlation between 35 U.S .C. 

§§ 101 and 112, first paragraph. The Federal Circuit concluded: 

FDA approval, however, is not a prerequisite for finding a compound useful within the 
meaning of the patent laws. Usefulness in patent law, and in particular in the context of 
pharmaceutical inventions, necessarily includes the expectation of further research and 
development . The stage at which an invention in this field becomes useful is well before 
it is ready to be administered to humans. Were we to require Phase II testing in order to 
prove utility, the associated costs would prevent many companies from obtaining patent 
protection on promising new inventions, thereby eliminating an incentive to pursue, through 
research and development, potential cures in many crucial areas such as the treatment of 
cancer. 

Brana at 1442-1443, citations omitted, emphasis added. As set forth above, the present polymorphisms 
are useful in forensic analysis as described in the specification as originally filed, without the need for any 
further research. As discussed above, even if the use of these polymorphic markers provided additional 
information on the percentage of particular subpopulations that contain these polymorphic markers, this 
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would not mean that "additional research" is needed in order for these markers as they are presently 
described in the instant specification to be of use to forensic science. As stated above, using the 
polymorphic marker as described in the specification as originally field can definitely distinguish members 
of a population from one another. However, even if, arguendo, further research might be required in 
certain aspects of the present invention, this does not preclude a finding that the invention has utility, as set 
forth by the Federal Circuit' s holding in Brana, which clearly states, as highlighted in the quote above, that 
"pharmaceutical inventions, necessarily includes the expectation of further research and development " 
(Brana at 1442-1443, emphasis added). In assessing the question of whether undue experimentation 
would be required in order to practice the claimed invention, the key term is "undue", not 
"experimentation". In re Angstadt and Griffin, 190 USPQ 214 (CCPA 1976). The need for some 
experimentation does not render the claimed invention unpatentable. Indeed, a considerable amount of 
experimentation may be permissible if such experimentation is routinely practiced in the art. In re Angstadt 
and Griffin, supra\ Amgen, Inc. v. Chugai Pharmaceutical Co., Ltd., 18 USPQ2d 1016 (Fed. Cir. 
1 99 1 ). Again, as a matter of law, it is well settled that a patent need not disclose what is well known in the 
art (In re Wands, supra). 

The Examiner further stated that "Applicants have not identified any particular reason for use of this 
particular polymorphism in forensic analysis or any particular benefit that would derive from analysis of this 
polymorphism" (the Final Action at page 2). As clearly set forth above, Appellants respectfully point out 
that the presently described polymorphisms are useful in forensic analysis for the same reason that any 
marker is useful in forensic analysis - specifically, to specifically identify individual members of the human 
population based on the presence or absence of the described polymorphism. Using the polymorphic 
markers as described in the specification as originally field can distinguish members of a population from 
one another. In the worst case scenario, each of these markers are useful to distinguish 50% of the 
population (in other words, the marker being present in half of the population). The ability to eliminate 50% 
of the population from a forensic analysis clearly is a real world, practical utility. Thus, the Examiner' s 
argument does not support the alleged lack of utility. 

Importantly, it has been clearly established that a statement of utility in a specification must be 



accepted absent reasons why one skilled in the art would have reason to doubt the objective truth of such 

statement. In re Lunger, 503 F.2d 1380, 1391, 183 USPQ 288, 297 (CCPA, 1974; "Lunger"); In re 

Murzocchi, 439 R2d 220, 224, 169 USPQ 367, 370 (CCPA, 1971). As set forth in In re Lunger (183 

USPQ 288 (CCPA 1974); "Lunger"): 

As a matter of Patent Office practice, a specification which contains a disclosure of utility 
which corresponds in scope to the subject matter sought to be patented must be taken as 
sufficient to satisfy the utility requirement of § 101 for the entire claimed subject matter 
unless there is a reason for one skilled in the art to question the objective truth of the 
statement of utility or its scope. 

Lunger & 297, emphasis in original. As set forth in the MPEP, "Office personnel must provide evidence 
sufficient to show that the statement of asserted utility would be considered 'false' by a person of ordinary 
skill in the art" (MPEP, Eighth Edition at 2100-40, emphasis added). Thus, absent such evidence from the 
Examiner concerning the use of the presently described polymorphisms in forensic analysis, the present 
claims clearly meet the requirements of 35 U.S.C. § 101. 

Although Appellants need only make one credible assertion of utility to meet the requirements of 
35 U.S.C. § 101 {Raytheon v. Roper, 220 USPQ 592 (Fed. Cir. 1983); In re Gottlieb, 140 USPQ 665 
(CCPA 1964); In re Muluchowski, 1 89 USPQ 432 (CCPA 1976); Hoffman v. Klaus, 9 USPQ2d 1657 
(Bd. Pat. App. & Inter. 1988)), as set forth by Appellants in the Response to the First Action and the 
Response to the Final Action, the present sequence has a number of patentable utilities, among them, as 
detailed in the specification as originally filed, on page 3, lines 7-10, in "the identification of protein coding 
sequence". This is evidenced by the fact that SEQ ID NO: 1 can be used to map the 29 coding exons on 
chromosome 9 (present within GenBank Accession Numbers AL591423, AL353895, AL449963, and 
AL1 5 8 1 50, which are four overlapping clones from human chromosome 9; alignments and the first page 
from the GenBank records are shown in Exhibit A). The specification details, at page 3 , lines 10-13, that 
the present sequence "identify biologically verified exon splice junctions, as opposed to splice junctions that 
may have been bioinformatically predicted from genomic sequence alone". It is well known that intron/exon 
boundaries are mutational hot spots, and thus the identification of the actual splice sites is of great utility to 
the skilled artisan. The specification details, at page 12, lines 5-11, that "sequences derived from regions 
adjacent to the intron/exon boundaries of the human gene can be used to design primers for use in 



-9- 



amplification assays to detect mutations within the exons, introns, splice sites (e.g. , splice acceptor and/or 
donor sites), etc., that can be used in diagnostics and pharmacogenomics". Appellants respectfully submit 
that the practical scientific value of biologically validated , expressed, spliced, and polyadenylated mRNA 
sequences is readily apparent to those skilled in the relevant biological and biochemical arts. Thus, the 
present claims clearly meet the requirements of 35 U.S.C. § 101. 

As yet a further example of the utility of the presently claimed polynucleotides, as described in the 
specification at least at page 3, lines 7-8, the present nucleotide sequence has a specific utility in mapping 
the protein encoding regions of the corresponding human chromosome, specifically chromosome 9, as 
described in the specification at least on page 3, lines 8-10. This is evidenced by the fact that SEQ ID 
NO: 1 can be used to map the 29 coding exons on chromosome 9, as detailed above (Exhibit A). Clearly, 
the present polynucleotide provides exquisite specificity in localizing the specific region of human 
chromosome 9 that contains the gene encoding the given polynucleotide, a utility not shared by virtually any 
other nucleic acid sequences. In fact, it is this specificity that makes this particular sequence so useful. Early 
gene mapping techniques relied on methods such as Giemsa staining to identify regions of chromosomes. 
However, such techniques produced genetic maps with a resolution of only 5 to 10 megabases, far too low 
to be of much help in identifying specific genes involved in disease. The skilled artisan readily appreciates 
the significant benefit afforded by markers that map a specific locus of the human genome, such as the 
present nucleic acid sequence. For further evidence in support of the Appellants' position, the Board is 
invited to review, forexample, section 3 of Venter a/. (2001, Science 297:1304, at pp. 1317-1321, 
includingFig. 1 1 at pp.1324-1325; Exhibit B), which demonstrates the significance of expressed sequence 
information in the structural analysis of genomic data. The presently claimed polynucleotide sequence 
defines a biologically validated sequence that provides a unique and specific resource for mapping the 
genome essentially as described in the Venter et al. article. 

Appellants respectfully remind the Board that only a minor percentage (2-4%) of the genome 
actually encodes exons, which in-turn encode amino acid sequences. The presently claimed polynucleotide 
sequence provides biologically validated empirical data (e.g., showing which sequences are transcribed, 
spliced, and polyadenylated) that specifically define that portion of the corresponding genomic locus that 
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actually encodes exon sequence, as described above. Equally significant is that the claimed polynucleotide 
sequence defines how the encoded exons are actually spliced together to produce an active transcript (/. e. , 
the described sequences arc useful for functionally defining exon splice-junctions). Thus, the present claims 
clearly meet the requirements of 35 U.S.C. § 101. 

The Final Action questioned these asserted utilities, stating that "applicants have not identified any 
particular reason for using this polynucleotide in mapping chromosome 9" (the Final Action bridging 
pages 3 and 4). The Examiner once again seems to be confusing the requirements of a specific utility with 
a unique utility. The fact that a small number of other nucleotide sequences could be used to map the 
protein coding regions in this specific region of chromosome 9 does not mean that the use of Appellants ' 
sequence to map the protein coding regions of chromosome 9 is not a specific utility {Carl Zeiss Stiftung 
v. Renishaw PLC, supra). 

Furthermore, as set forth in the Response to the First Action and the Response to the Final Action, 
the present invention has a number of additional substantial and credible utilities, not the least of which is, 
as described in the specification on page 6, lines 16-18, that the present nucleotide sequences have utility 
in assessing gene expression patterns using high-throughput DNA chips. Such "DNA chips" clearly have 
utility, as evidenced by hundreds of issued U.S . Patents, as exemplified by U.S . Patent Nos. 5 ,445 ,934 
(Exhibit C), 5,556,752 (Exhibit D), 5,744,305 (Exhibit E), 5,837,832 (Exhibit F), 6,156,501 
(Exhibit G) and 6,261 ,776 (Exhibit H). As the present sequences are specific markers of the human 
genome (see above), and such specific markers are targets for the discovery of drugs that are associated 
with human disease, those of skill in the art would instantly recognize that the present nucleotide sequences 
would be an ideal, novel candidate for assessing gene expression using such DNA chips. Clearly, 
compositions that enhance the utility of such DNA chips, such as the presently claimed nucleotide 
sequences, must in themselves be useful. 

The Final Action also questioned this utility, stating that "Applicants have also not identified any 
particular reason for use of this particular polynucleotide in "DNA chips" (the Final Action at page 2). 
First, Appellants point out that nucleic acid sequences are commonly used in gene chip applications without 
any information regarding the function of the encoded protein, or even evidence regarding whether the 
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sequence is actually even expressed. Thus, the present sequence, which has been biologically validated 
to be expressed, has a much greater utility than sequences that are merely predicted to be expressed based 
on bioinformatic analysis. Second, Appellants point out that nucleic acid sequences such as SEQ ID NO: 1 
are routinely used by companies throughout the biotechnology sector exactly as they are presented in the 
Sequence Listing, without any further experimentation. Expression profiling does not require a knowledge 
of the function of the particular nucleic acid on the chip - rather the gene chip indicates which DNA 
fragments are expressed at greater or lesser levels in two or more particular tissue types. Furthermore, 
although further information regarding the biological activity of a particular nucleic acid sequence might 
make it even more useful in gene chip applications, this does not mean that the use of the presently claimed 
nucleic acid sequence in gene chip applications is not a specific utility {Carl Zeiss Stiftung v. Renishaw 
PLC, supra). 

Evidence of the "real world" substantial utility of the present invention is further provided by the fact 
that there is an entire industry established based on the use of gene sequences or fragments thereof in a 
gene chip format. Perhaps the most notable gene chip company is Affymetrix. However, there are many 
companies that have, at one time or another, concentrated on the use of gene sequences or fragments, in 
gene chip and non-gene chip formats, for example: Gene Logic, ABI-Perkin-Elmer, HySeq and Incyte. 
In addition, one such company (Rosetta Inpharmatics) was viewed to have such "real world" value that it 
was acquired by large a pharmaceutical company (Merck) for significant sums of money (net equity value 
of the transaction was $620 million). The "real world" substantial industrial utility of gene sequences or 
fragments would, therefore, appear to be widespread and well established. Clearly, there can be no doubt 
that the skilled artisan would know how to use the presently claimed sequences (see Section VIII(B), 
below), strongly arguing that the claimed sequences have utility. Given the widespread utility of such "gene 
chip" methods using public domain gene sequence information (often with no indication of the biological 
function of the encoded protein), there can be little doubt that the use of the presently described novel 
sequences would have great utility in such DNA chip applications. Thus, the present claims clearly meet 
the requirements of 35 U.S.C. § 101. 

Persons of skill in the art, as well as venture capitalists and investors, readily recognize the utility, 
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both scientific and commercial, of genomic data in general, and specifically human genomic data. Billions 
of dollars have been invested in the human genome project, resulting in useful genomic data (see, e.g., 
Venter et al , supra; Exhibit B). The results have been a stunning success as the utility of human genomic 
data has been widely recognized as a great gift to humanity (see, e.g. , Jasny and Kennedy, 200 1 , Science 
297 : 1 1 53 ; Exhibit I). Clearly, the usefulness of human genomic data, such as the presently claimed nucleic 
acid molecule, is substantial and credible (worthy of billions of dollars and the creation of numerous 
companies focused on such information) and well-established (the utility of human genomic information has 
been clearly understood for many years). 

Additionally, Appellants pointed out in the Response to the Final Action that two sequences sharing 
nearly 100% percent identity at the protein level over an extended region of the claimed sequence are 
present in the leading scientific repository for biological sequence data (GenBank), and have been 
annotated by third party scientists wholly unaffiliated with Appellants as "Homo sapiens 
ADAMTS-like 1" variants 1 and 2 (GenBank accession numbers NMJ39238 and NM_052866; 
alignments and GenBank reports arc shown in Exhibit J). In the specification as originally filed, Appellants 
noted the similarity of the present sequence to "matrix metalloprotease" (specification at page 2, lines 7-8), 
and particularly "the ADAMTS family of metalloproteases" (specification at page 17, lines 31-32). 
Furthermore, the scientists that described ADAMTS-like 1 have determined that the protein is localized 
to the extracellular matrix (Hirohata et al, J. Biol. Chem. 277:12182-12189, 2002; Exhibit K). 
Appellants respectfully point out that the legal test for utility simply involves an assessment of whether those 
skilled in the art would find any of the utilities described for the invention to be believable . Given these two 
GenBank annotations and the manuscript by Hirohata et al. , there can be no question that those skilled in 
the art would clearly believe that Appellants' sequence is an ADAMTS-like protease, and would thus 
readily understand the utility of the presently claimed sequence, as described above, particularly in gene 
chip applications. As this is the standard for meeting the utility requirement of 35 U.S. C. § 101, Appellants 
submit that the present claims must clearly meet the requirements of 35 U.S.C. § 101. 

Rather, regarding the utility requirements under 35 U.S.C. § 101 , the Federal Circuit has clearly 
stated "(Ohe threshold of utility is not high: An invention is 'useful' under section 101 if it is capable of 
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providing some identifiable benefit." Juicy Whip Inc. v. Orange Bang Inc., 185F.3d 1364, 51 USPQ2d 
1700 (Fed. Cir. 1999) (citing Brenner v. Manson, 383 U.S. 519, 534 (1966)). Additionally, the Federal 
Circuit has stated that "(t)o violate § 101 the claimed device must be totally incapable of achieving a useful 
result." Brooktree Corp. v. Advanced Micro Devices, Inc., 977 F.2d 1555, 1571, 24 USPQ2d 1401 
(Fed. Cir. 1992), emphasis added. Cross v. lizuka (753 F.2d 1040, 224 USPQ 739 (Fed. Cir. 1985); 
''Cross 9 ') states "any utility of the claimed compounds is sufficient to satisfy 35 U.S.C. § 101 ". Cross at 
748, emphasis added. Indeed, the Federal Circuit recently emphatically confirmed that "anything under 
the sun that is made by man" is patentable (State Street Bank & Trust Co. v. Signature Financial Group 
Inc. , 149 F.3d 1368, 47 USPQ2d 1596, 1600 (Fed. Cir. 1998), citing the U.S. Supreme Court's decision 
in Diamond vs. Chakrabarty, 447 U.S. 303, 206 USPQ 193 (U.S., 1980)). 

Finally, While Appellants are well aware of the new Utility Guidelines set forth by the USPTO, 
Appellants respectfully point out that the current rules and regulations regarding the examination of patent 
applications is and always has been the patent laws as set forth in 35 U.S.C. and the patent rules as set 
forth in 37 C.F.R., not the Manual of Patent Examination Procedure or particular guidelines for patent 
examination set forth by the USPTO. Furthermore, it is the job of the judiciary, not the USPTO, to 
interpret these laws and rules. Appellants are unaware of any significant recent changes in either 
35 U.S.C. § 101, or in the interpretation of 35 U.S.C. § 101 by the Supreme Court or the Federal Circuit 
that is in keeping with the new Utility Guidelines set forth by the USPTO. This is underscored by numerous 
patents that have been issued over the years that claim nucleic acid fragments that do not comply with the 
new Utility Guidelines. As examples of such issued U.S. Patents, the Board is invited to review U.S . Patent 
Nos. 5,817,479 (Exhibit L), 5,654,173 (Exhibit M), and 5,552,281 (Exhibit N; each of which claims 
short polynucleotides), and recently issued U.S . Patent No. 6,340,583 (Exhibit O; which includes no 
working examples), none of which contain examples of the "real-world" utilities that the Examiner seems 
to be requiring. As issued U.S . Patents are presumed to meet all of the requirements for patentability, 
including 35 U.S.C. §§ 101 and 1 12, first paragraph (see Section VIE(B), below), Appellants submit that 
the present polynucleotides must also meet the requirements of 35 U.S.C. § 101 . While Appellants agree 
that each application is examined on its own merits, Appellants are unaware of any changes to 
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35U.S.C. § 101, orin the interpretation of 35 U.S.C. § 101 by the Supreme Court or the Federal Circuit, 
since the issuance of these patents that render the subject matter claimed in these patents, which is similar 
to the subject matter in question in the present application, as suddenly non-statutory or failing to meet the 
requirements of 35 U.S.C. § 101. Thus, holding Appellants to a different standard of utility would be 
arbitrary and capricious, and, like other clear violations of due process, cannot stand. 

For each of the foregoing reasons, Appellants submit that the rejection of claims 1-4 under 
35 U.S.C. § 101 must be overruled. 

B. Are Claims 1-4 Unusable Due to a Lack of Patentable Utility? 

The Final Action next rejects claims 1-4 under 35 U.S.C. § 1 12, first paragraph, since allegedly 
one skilled in the art would not know how to use the invention, as the invention allegedly is not supported 
by either a clear asserted utility or a well-established utility. 

The arguments detailed above in Section Vm(A) concerning the utility of the presently claimed 
sequences are incorporated herein by reference. As the Federal Circuit and its predecessor have 
determined that the utility requirement of Section 101 and the how to use requirement of Section 112, first 
paragraph, have the same basis, specifically the disclosure of a credible utility (In re Brana, supra\ In re 
Jolles, 628 F.2d 1322, 1326 n.l 1, 206 USPQ 885, 889 n.l 1 (CCPA 1980); In re Fouche, 439 F.2d 
1237, 1243, 169 USPQ 429, 434 (CCPA 1971)), Appellants submit that as claims 1-4 have been shown 
to have "a specific, substantial, and credible utility", as detailed in Section VDI(A) above, the present 
rejection of claims 1-4 under 35 U.S.C. § 112, first paragraph, cannot stand. 

Appellants therefore submit that the rejection of claims 1-4 under 35 U.S.C. § 112, first paragraph, 
must be overruled. 
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IX. APPENDIX 

The claims involved in this appeal are as follows: 

1 . (Original) An isolated nucleic acid molecule comprising the nucleotide sequence of SEQ ID 

NO:l. 

2. (Original) An isolated nucleic acid molecule comprising a nucleotide sequence encoding the 
amino acid sequence shown in SEQ ID NO:2. 

3. (Original) An isolated expression vector comprising the nucleotide sequence of SEQ ID NO: 1 . 

4. (Previously Presented) A host cell comprising the expression vector of claim 3. 
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X. CONCLUSION 

Appellants respectfully submit that, in light of the foregoing arguments, the Final Action's conclusion 
that claims 1 -4 lack a patentable utility and are unusable by the skilled artisan due to a lack of patentable 
utility is unwarranted. It is therefore requested that the Board overturn the Final Action's rejections. 



Respectfully submitted, 



October 14. 2003 ^^^7^-^A^<<^ 

Date David W. Hibler Reg. No. 41,071 
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tccagactcgcaggcagaggaagctgcacttcgtggtg^ 

I IIIIIIIIIIIIIIIIIIIIIIIIIIMIIIIIIIMIIIIIMIIIIIIIMIIMI 
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cagcggcccagcagctctcagcctcggaggtggtcacccacctgggg^ 
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Sbjct: 131633 giaccUUcgtgaccagccgccggcccccacagct.cctgaagtcctgcaatttggatcc 1316 



Query: 1995 ctgcccagcaaggt 2008 

! i 1 1 I I I I I I J I I I 

Sbjct: 131693 ctgcccagcaaggt 131706 



Score = 252 bits (127), Expect = 2e-63 
Identities = 127/127 (100%) 
Strand = Plus / Plus 



Query: 475 attgttggctgcgatcaccagctgggaagcaccgtcaaggaagataactgt^ 534 
Sbjct: 32280 aiigtiggcUcg^^ 32339 



Query: 535 aacggagatgggtccacctgccggctggtccgagggcagtataaatcccag^ 594 

I I I I I I I I I I I I I I I I I I I M I I I I I I I I I I I I I I I I M I I I I I I I I M I I I I I I I I 
Sbjct : 32340 Ilcggagalgggtccacctgccggctggtccgagggcagtataaatcccagctctccgca 323 



Query: 595 accaaat 601 

MUM 

Sbjct: 32400 accaaat 32406 



Score = 230 bits (116), Expect = 7e-57 
Identities = 116/116 (100%) 
Strand = Plus / Plus 



itcaacccatca 892 



Sbjct: 67675 agattcgtaac 



i^^^^^^t^tctt«t«t«^cccatc« 67734 



Query: 893 tccaccgatggagggagacggatttctttccttgctcagca^ 

1 1 1 1 1 1 II 1 1 II 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 II 

Sbjct: 67735 UciccglUUiiUiacggatttctttccttgctcagcaacctgtggaggaggt 67790 



Score = 176 bits (89), Expect = 9e-41 
Identities = 89/89 (100%) 
Strand = Plus / Plus 



Sbjct: 94753 IgagUaciicc^ 94812 
Query: 1548 aggagctgctgtgtcagaggagccctcgt 1576 

IIIIIIIIIIIIIIIIIIIIIMM Ml g4841 

Sbjct: 94813 aggagctgctgtgtcagaggagccctcgt 94841 



Score = 149 bits (75), Expect = 2e-32 
Identities = 75/75 (100%) 
Strand = Plus / Plus 



Query: 602 



661 



cggatgatactgtggttgcaattccctatggaagtagacatattcgccttgtcttaaaa^ 

IIIIIIIIIIIIIIIIIIIIIMIMIII IIIIIIIIIIIIIIIIIIMMIMI 

Sbjct : 45975 cggliglUcUUgiigcaaiiccctatggaagtagacatattcgccttgtcttaaaag 
Query: 662 gtcctgatcacttat 676 

I MM Ml MM IN nAQ 

Sbjct: 46035 gtcctgatcacttat 46049 



Score = 111 bits (56), Expect = 5e-21 
Identities = 56/56 (100%) 
Strand = Plus / Plus 

— ,o 8 3 f « n ~ ™< ~ it T?TiiTTT?hTiTmTiTTTiTTTi ^ 

Sbjct, 85891 iUiUc^IiicIUc»Uic.Uccttat g .==tct«=c«tccccttc=tcg 9t 85946 



46034 



r 



>AL449963 ACCESSION: AL449963 NID: gi 20387012 emb AL449963 . 2 HS399M15 Homo 
sapiens chromosome 9 BAC RP11-399M15, complete sequence 
Length = 213216 

Score =■ 472 bits (238), Expect = e-12? 
Identities = 238/238 (100%) 
Strand = Plus / Plus 



Query: 


237 


Sbjct : 


i'04804 


Query: 


297 


Sbjct: 


104864 


Query: 


357 


Sbjct: 


104924 


Query: 


417 


Sbjct : 


104984 



ggactgcccaccagaagcaggtgatttccgagctcagcaat^ 

II 1 1 1 1 1 1 '1 1 1 1 1 1'l 1 1- 1 1 1 1 1 1 1 1 1 H-i ■ ■ ■ ■ ! i-i ■ » ■ ■- 1 > > ■ ■ ■ 1 ■ IJULiiilXii ■ i 

ggactgcccaccagaagcaggtgatttccgagctcagcaatgctcagctcataatgatgt 



296 



104863 



caagcaccatggccagttttatgaatggcttcctgtgtc^ 356 

IIIIIIIIIIIMIIIIIIIIMIIIMIIIIIIIIIIIIIIMIIIIIIIIIIIMMI 

104864 caagcaccatggccagttttatgaatggcttcctgtgtctaatgaccctgacaacccatg 104923 



ttcactcaagtgccaagccaaaggaacaaccctggttgttgaactagcacctaaggtct^ 

IIIIIIIIIIIIIIMIIIIIIIIMIIIIIIIIIIIIMII 1 1 1 1 1 1 Ml I I I 

ttcactcaagtgccaagccaaaggaacaaccctggttgttgaactagcacctaaggtctt 



416 



104983 



agatggtacgcgttgctatacagaatctttggatatgtgcatcagtggtttatgccaa 474 

IIIIIINIllllllllllllllliriMIII . IMI-MIIII-IIIIJI IIIMI - ni;nA1 

agatggtacgcgtUctatacagaatctttggatatgtgcatcagtggtttatgccaa 105041 



Score = 408 bits (206), Expect = e-110 
Identities = 206/206 (100%) 
Strand = Plus / Plus 



Query: 


1136 


Sbjct : 


211086 


Query: 


1196 


Sbjct : 


211146 


Query: 


1256 


Sbjct : 


211206 



ggtgggaggccaccccatggaccgcgtgctcctcctcgtgtggggggggcatccagagc 

IIIIIIIIIIIIIIIIIIIIIIIIIIMIIIIIMIMMHIIIIMIMM 



c 1195 



gUiggaggccaccccatggaccgcgtgctcctcctcgtgtggggggggcatccagagc 



c 211145 



gggcagtttcctgtgtggaggaggacatccaggggcatgtcacttcagtggaagagtgga 1255 

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIINIIIIIIIIIIIIMINII , l19n _ 

gggcagtttcctgtgtggaggaggacatccaggggcatgtcacttcagtggaagagtgga 211205 



aa 



aa 



tgcatgtacacccctaagatgcccatcgcgcagccctgcaacatttttgactgcccta 1315 

III i-l 1 1 1 1 1 1 1 1 1 1 1 1 l-l 1 1 1 1 1 1 1 II 1 1 1 1 1 1 1 1 I.I I l-l.l.l 1 1 1.1 1 1 1 III I l-l M , 112S5 

tgcatgtacacccctaagatgcccatcgcgcagccctgcaacatttttgactgcccta 211265 



Query: 
Sbjct : 



1316 aatggctggcacaggagtggtctccg 1341 

INI II I II IIIMI II II 

211266 aatggctggcacaggagtggtctccg 211291 



Score = 313 bits (158), Expect = 2e-81 
Identities = 158/158 (100%) 
Strand = Plus / Plus 



Ouery 677 atctggaaaccaaaaccctccaggggactaaaggtgaaaacagtctcagctccacaggaa 736 
Query. * t. g a I I I I I I I I 

Sbjct: 170029 aicUgaaacciaaacc^^ 170088 

Query: 737 ctttccttgtggacaattctagtgtggacttccag^ 796 - 

Sbjct: 170089 iiiicciUUglciliicUg^iggacttccagaaatttccagacaaagagatactga 170148 

Ouerv- 797 gaatggctggaccactcacagcagatttcattgtcaag 834 

Query. f | | |ff | jff | | | | | | | | | | | | I I I II I I I I I 

Sbjct: 170149 gaatggctggaccactcacagcagatttcattgtcaag 1/uiab 



Score = 295 bits (149) , Expect = 4e-76 
Identities = 149/149 (100%) 
Strand = Plus / Plus 



Query- 1341 gtgcacagtgacatgtggccagggcctcagataccgtgtggtcctctgcatcgaccatcg^ 1400 
Query. T| | I | | | I I I I I I I I I I I I I I I I I I M I M I I I I I I I I I I I I I I I I I I H I I I I I I I I 

Sbjct: 212586 gUcicIgUacIUUiccagggcctcagataccgtgtggtcctctgcatcgaccatcg 2126 



Query: 1401 aggaatgcacacaggaggctgtagcccaaaaacaaagcccca^ 1460 
| | | | | | | | | | | | | | | | | | | | | I I I I I I I I I I I I I M I I I I I I I I I I I 1 I III J, i I ' ' 



ggaggctgtagcccaaaaacaaagccccacataaaagaggaatgcat 212705 



Sbjct: 212646 aggaatgcacaca< 



Ouery: 1461 cgtacccactccctgctataaacccaaag 1489 

IIIIIIIIIIIIIIIIMIIIIIIIIIM 

Sbjct: 212706 cgtacccactccctgctataaacccaaag 21^ /J« 



Score = 280 bits (141), Expect = 2e-71 
Identities = 141/141 (100%) 
Strand = Plus / Plus 

Query : 945 aggttatcagctgacatcggctgagtgctacg^ 

I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I 

Sbjct : 192708 IggiUicigcigacaicggctgagtgctacgatctgaggagcaaccgtgtggttgctga 



1004 
192767 



Query 1005 ccaatactgtcactattacccagagaacatcaaacccaaacccaagcttcaggagtgcaa 1064 

Query. j I I I I I I I | I I I I I I I I I | | I I I I I I I I I I I I I I I I I I I I I I I I I I I I M I I I I I 

Sbjct: 192768 ccaliacUicacUiiac^ 192827 



Ouerv: 1065 cttggatccttgtccagccag 1085 

I I I M I I I I I I I I I I I I M I I iQoQ4ft 
Sbjct: 192828 cttggatccttgtccagccag 192848 

Score = 258 bits (130), Expect = 8e-65 
Identities =130/130 (100%) 
Strand = Plus / Plus 

Sbjct: 35606 gigiUcIggaccgcicg^ 35665 

Ouerv- 123 atggagtgaatgctcacgcacctgcgggggtggggcctcctactctc^ 182 
Query. las | | | | I | I | I I I I I I I I I I I I I I I I I I I I I I I I I M Ml I I I I I I M I I I I I I I I , c „ oc 

Sbjct: 35666 iUgigUaaUcUacg^ 35725 

^ Query: 183 gagcagcaag 192 

llllllllll 

Sbjct: 35726 gagcagcaag 35735 



Score = 252 bits (127), Expect = 5e-63 
Identities = 127/127 (100%) 
Strand = Plus / Plus 

oue^ 4 75 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT TTTTTTTTTTTTTTTTTTTTTTTTT "* 



— 535 tttttttttttttttttttt7??ttt?ttttttttttt?tttttttt?ttttt?tttttt i53i37 

Sbjct: 153078 IlUgigiUootccIcctgccooctogtccg.gggcagtataa.tccc.octotccgca 153137 



594 



Query: 595 accaaat 601 
lllllll 

Sbjct: 153138 accaaat 153144 



Score = 230 bits (116)/ Expect = 2e-56 
Identities = 116/116 (100%) 
Strand = Plus / Plus 

. . . . _____ . 892 

Query: 833 



Sbjct: 188412 



Query: 
Sbjct: 



Score = 149 bits (75), Expect = 5e-32 
Identities = 75/75 (100%) 
Strand = Plus / Plus 



Ouerv- 602 cggatgatactgtggttgcaattccctatggaagtagacatattcgccttgtcttaaaag 661 

Query. egg g g gg g I I I I I I I I I I I I I I I I I 

Sbjct: 166718 iggaigaiacigtggttgcaattccctatggaagtagacatattcgccttgtcttaaaag 166 



Query: 662 gtcctgatcacttat 676 

1 1 1 mi 1 1 1 1 1 1 mi 

Sbjct: 166778 gtcctgatcacttat 166792 

Score = 125 bits (63), Expect = 8e-25 
Identities = 63/63 (100%) 
Strand = Plus / Plus 

Query: 1 atggaatgctgccgtcgggcaactcctggcacactgctcctctttctgg^ 60 

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIMIIIIMIIIMIIIIMIINNIIIIII 

Sbjct: 5010 a^gaatgctgccgtcgggcaactcctggcacactgctcctctttctggctttcctgctc 5069 
Query: 61 ctg 63 

Ml 

Sbjct: 5070 ctg 5072 



Score = 111 bits (56), Expect = le-20 
Identities = 56/56 (100%) 
Strand = Plus / Plus 



Query: 1083 cagtgacggatacaagcagatcatgccttatgacctctaccatccccttcct^ 1138 

I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I 1 1 1 1 Ml I I Ml MM 
Sbjct: 206629 cagtgacggatacaagcagatcatgccttatgacctctaccatccccttcctcggt 2066 



Score =93.7 bits (47), Expect = 3e-15 
Identities = 47/47 (100%) 
Strand = Plus / Plus 



Query: 192 gagctgtgaaggaagaaatatccgatacagaacatgcagtaatgtgg 238 

II llllll Mill MMIMMMMMMMM 

Sbjct: 64022 gagctgtgaaggaagaaatatccgatacagaacatgcagtaatgtgg 64068 



>AL158150. 14. 1.168011 

Length = 168011 

Score = 442 bits (223), Expect = e-120 
Identities = 223/223 (100%) 
Strand = Plus / Plus 



Query: 


4960 


Sbjct : 


^03373 


Query: 


5020 


Sbjct: 


103433 


Query : 


5080 


^3bj'ct : 


103493 


Query: 


5140 


Sbjct: 


103553 


Score 


= 424 


Identities = 


Strand = Plu 


Query: 


4249 ■ 


Sbjct: 


84513 . 


Query: 


4309 


Sbjct: 


84573 


Query: 


4369 


Sbjct : 


84633 


Query: 


4429 


Sbjct: 


84693 



aggcctgtgagcacccagaactgctggtcagaggcctgcagtgtacactggagagt^ 5019 

1 1 1 1 I I 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 !I "I ' I ' ' 'ii i i ' i ' ' 1034 : 



ctgtggaccctgtgcacagctacctgtggcaactacggcttccagtcccggcg^ 

MINI I IIIIMIMIMI I MIN I II 1 1 1 1 1 1 1 1 1 1 II 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

iUUgacic^^cacagctaictgtggcaactacggcttccagtcccggcgtgtggag 



5079 



103492 



5139 



tgtgtgcatgcccgcaccaacaaggcagtgcctgagcacctgtgctcctgggggcc^ 

IIIIIIIIIIIIIIIIIIIIIIMIIIIIIIIIIIIIIIIIIIIIIMIIINIIIIIII 

UUUca^cccgcaccaacaaggcagtgcctgagcacctgtgctcctgggggccccgg 103552 



cctgccaactggcagcgctgcaacatcaccccatgtgaaaaca 

i ii ii i ii in i ii i ii ii 1 1 linn 

cctgccaactggcagcgctgcaacatcaccccatgtgaaaaca 



5182 



103595 



-115 



(100%) 



/ Plus 



ggctgccccatcaaaggfecaccctgtccctaatatcacctggtttcatggtgg^ 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Ml 1 1 MU 1 1 1 Ml 1 1 1 1 HI 1 1 Ml I Ui 



4308 



84572 



attgtcactgccacaggactgacgcatcacatct^ 4368 



attitkctgccacaggact^ 

gcaaaccttagcggtgggtctcaaggggaattcagctgccttgctca^^ 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 J: i >> i i ' JL ■ JL iJLXiii 



84632 



4428 



84692 



gtgctcatgcagaaggcatctttagtgatccaag 

IIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIM 

gtgctcatgcagaaggcatctttagtgatccaag 



4462 



84726 



Score = 414 bits (209) , Expect = e-112 
Identities = 209/209 (100%) 
Strand = Plus / Plus 



Query: 4643 ggtggatggtgacctcctggtctgcctgtacccggagctgt^ 4702 

Query gg g 9 1 1 1 1 1 1 1 1 1 1 1 ' ' ' ' ' ' ' ' ' ' ' 8913 

Sbjct- 89071 ggtggatggtgacctcctggtctgcctgtacccggagctgtgggggaggtgtccagaccc 8913 



Query: 4703: gcagggtgacctgtcaaaagctgaaagcctctgggatctccacccctg^ 4762 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Mill 1 1 J. lllllll 89190 

Sbjct: 89131 gcagggtgacctgtcaaaagctgaaagcctctgggatctccacccctgtgtccaatgaca 89190 



Query: 4763 tgtgcacccaggtcgccaagcggcctgtggacacccaggcctgtaaccagcagctgtg^ 

IIIIIIIIIIIIIIIIMIIIIIIIIIIMIIIIIIIIIIMIIIMIIIIIIIIIMM 

Sbjct : 89191 tgtgcacccaggtcgccaagcggcctgtggacacccaggcctgtaaccagcagctgtgtg 



4822 



■Query: 4823 tggagtgggccttctccagctggggccag 4851 

MINI MINIMI I MINI qo „ o 

Sbjct: 89251 tggagtgggccttctccagctggggccag 89279 



Score = 369 bits (186), Expect = le-98 
Identities = 186/186 (100%) 
Strand = Plus / Plus 

Query: 4461 agattactggtggtctgtggacagactggcaacctgct^ 4520 

M I II I I I I I I I II II M I II I I I I I I I I I II I I I I I I I M II I I I I I I I I M M I M II 
Sbjct: 86249 Igiiiaciggtggtctgtggacagactggcaacctgctcagcctcctgtggtaaccgggg 86308 

Query: 4521 ggttcagcagccccgcttgaggtgcctgctgaacagcacggaggtcaaccctgcccactg 4580 

MINIMI II II I Mill '''''ilii 86368 

Sbjct: 86309 ggttcagcagccccgcttgaggtgcctgctgaacagcacggaggtcaaccctgcccactg 86368 



Query: 4581 cgcagggaaggttcgccctgcggtgcagcccat^ 4640 

1 1 1 1 1 1 II 1 1 1 1 1 1 1 1 1 1 1 II 1 1 1 1 1 1 1 1 1 1 II 1 1 1 M I II 1 1 1 1 II II 1 1 1 1 1 1 1 1 1 1 1 8642 

Sbjct: 86369 cgcagggaaggttcgccctgcggtgcagcccatcgcgtgcaaccggagagactgcccttc 8642 



Query: 4641 tcggtg 4646 
MIMI 

Sbjct: 86429 tcggtg 86434 



Score = 361 bits (182), Expect = 3e-96 
Identities = 182/182 (100%) 
Strand = Plus / Plus 



Query- 3933 aggagtgcctgaagctgaagtcacttggttcaggaataaaagcaaactgggctccccgca 3992 

Query MINI I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I 

Sbjct: 22965 Iggigtgcctgaagctgaagtcacttggttcaggaataaaagcaaactgggctccccgca 23024 



4052 
23084 

4112 



Query: 3993 ccatctgcacgaaggctccttgctgctcacaaacgtgtcctcctcggatcagggcc^ 

illlllllllllllllllllllllllllllllllllMIMIMIIIIIIIIIIMMM 

Sbjct : 23025 ccatctgcacgaaggctccttgctgctcacaaacgtgtcctcctcggatcagggcctgta 

Query: 4053 ctcctgcagggcggccaatcttcatggagagctgactgagagcacccagctgctga^ 

I I I I I I I I 1 I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I 1 1 1 1 1 1 Ml I I Ml I 

Sbjct: 23085 ctcctgcagggcggccaatcttcatggagagctgactgagagcacccagctgctgatcct 23144 

Query: 4113 ag 4114 

" ■ II 
Sbjct: 23145 ag 23146 



Score = 274 bits (138), Expect = 5e-70 
Identities = 138/138 (100%) 
Strand = Plus / Plus 

Query: 4113 agatcccccccaagtccccacacagttggaagacatcagggccttgctcgctgccactgg 4172 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 M 1 1 1 M 1 1 1 1 II 1 1 1 1 1 1 1 1 1 II 1 1 1 1 1 1 1 M 1 1 1 1 1 1 1 1 1 1 1 

Sbjct- 26524 Igatcccccccaagtccccacacagttggaagacatcagggccttgctcgctgccactgg 26583 



Query: 4173 accgaaccttccttcagtgctgacgtctcctctgggaacacagctggtcctggatcctgg 

IIIIIIIIIIIIIIMMIIMMIIIIIMIMMMMMMIMIIIIIMMMM 

accgaaccttccttcagtgctgacgtctcctctgggaacacagctggtcctggatcctgg 



Sbjct: 26584 



Query: 4233 gaattctgctctccttgg 4250 

I I I I 1 I I I I I I I I I I I I I 
Sbjct: 26644 gaattctgctctccttgg 26661 



Score = 264 bits (133) , Expect > 5e-67 
Identities = 133/133 (100%) 
Strand = Plus / Plus 



4232 
26643 



Query: 3803 caggaaagccactagtgaaaac.gtcacgaatgacagtgatcaaca^ 3862 

I I I I I I I I I I I I I I I I I I I I II M I I M I 1 1 1 M I II I I M I I M 

Sbjct: 13789 caggliagc^ 13848 



Query: 3863 tcacagtcgatataggaagcaccatcaaaacagtgcagggagtgaatgtgacaatcaact 3922 

Mill: Ml I Mill MINN 

Sbjct: 13849 tcacagtcgatataggaagcaccatcaaaacagtgcagggagtgaatgtgacaatcaact 13908 

Query: 3923 gccaggttgcagg 3935 

MMIMMIMI 
Sbjct: 13909 gccaggttgcagg 13921 



Score = 228 bits (115), Expect = 3e-56 
Identities = 115/115 (100%) 
Strand = Plus / Plus 

Query: 4848 ccagtgcaatgggccttgcatcgggcctcacctagctgtgcaacacagacaagtcttc^ 4907 

I I I I I I I I I I I I I I I I I I I I I I I I I 1 I I I I I I I I I I I I I I ■ I ■ I ■ ■ ■ I ■ I > I > 10 2520 
Sbjct: 102461 ccagtgcaatgggccttgcatcgggcctcacctagctgtgcaacacagacaagtcttctg 102520 

"ouery: 4908 ccagacacgggatggcatcaccttaccatcagagcagtgcagtgctcttccgagg 4962 

I I MM II I II M 1 1 1 MM I II M 1 1 M I M M 1 1 II I M I II II M 1 1 1 M M io25?5 

Sbjct: 102521 ccagacacgggatggcatcaccttaccatcagagcagtgcagtgctcttccgagg 



Score = 212 bits (107), Expect = 2e-51 
Identities = 107/107 (100%) 
Strand = Plus / Plus 

Query: 5183 tggagtgcagagacaccaccaggtactgcgagaaggtgaaaca^ 5242 

I III III III I I II M M I II 1 1 1 1 1 II N'll I Mill I I I I 

Sbjct: 105125 tggagtgcagagacaccaccaggtactgcgagaaggtgaaacagctgaaactctgccaac iu^iu^ 
Query: 5243 tcagccagtttaaatctcgctgctgtggaacttgtggcaaagcgtga 5289 

I II IIIIM I NIMH 

Sbjct: 105185 tcagccagtttaaatctcgctgctgtggaacttgtggcaaagcgtga 105231 
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AL591423 54193 bp DNA linear PRI 16-NOV-2001 

Human DNA sequence from clone RP11-134P18 on chromosome 9, complete 
sequence . # 
AL591423 

AL591423.6 GI : 16973934 
HTG. 

Homo sapiens (human) 
Homo sapiens 

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi ; 

Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. 

1 (bases 1 to 54193) 

Almeida, J. 

Direct Submission 

Submitted ( 16-NOV-2001 ) Wellcome Trust Sanger Institute, Hinxton, 
4 Cambridgeshire, CB10 ISA, UK. E-mail enquiries: 
humquery@sanger.ac.uk Clone requests: clonerequest@sanger.ac.uk 
On Nov 17, 2001 this sequence version replaced qi : 16214807 . 
During sequence assembly data is compared from overlapping clones. 
Where differences are found these are annotated as variations 
together with a note of the overlapping clone name. Note that the 
variation annotation may not be found in the sequence submission 
corresponding to the overlapping clone, as we submit sequences with 
only a small overlap as described above. 

This sequence was finished as follows unless otherwise noted: all 
regions were either double-stranded or sequenced with an alternate 
chemistry or covered by high quality data (i.e., phred quality >= 
30); an attempt was made to resolve all sequencing problems, such 
as compressions and repeats; all regions were covered by at least 
one plasmid subclone or more than one M13 subclone; and the 
assembly was confirmed by restriction digest. The following 
abbreviations are used to associate primary accession numbers given 
in the feature table with their source databases: Em:, EMBL; Sw: , 
SWISSPROT; Tr: , TREMBL; Wp : , WORMPEP; Information on the W0RMPEP 
database can be found at 

ht tp : / /www . sanger . ac . uk/ Proi ec t s /C elecrans /wormpep This sequence 
was generated from part of bacterial clone contigs of human 
chromosome 9, constructed by the Sanger Centre Chromosome 9 Mapping 
Group. Further information can be found at 
http : / /www . Sanger . ac . uk/HGP/Chr9 

RP11-134P18 is from the library RPCI-11.1 constructed by the group 
of Pieter de Jong. For further details see 
http: //www.chori . org/bacpac /home .htm 
VECTOR: pBACe3 . 6 

IMPORTANT: This sequence is not the entire insert of clone 
RP11-134P18 It may be shorter because we sequence overlapping 
sections only once, except for a short overlap. 
The true left end of clone RP11-220B22 is at 52194 in this 
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H 1: AL353895. Human DNA sequenc...[gi:13751339] 



LOCUS 

DEFINITION 

ACCESSION 
VERSION 
KEYWORDS 
SOURCE 

ORGANISM 



REFERENCE 
AUTHORS 
TITLE 
JOURNAL 



COMMENT 



FEATURES 



AL353895 163163 bp DNA linear PRI 18-SEP-2001 

Human DNA sequence from clone RP11-503K16 on chromosome 9, complete 
sequence. - 
AL353895 

AL353895.4 GI:13751339 
HTG. 

Homo sapiens (human) 

Hom o sapiens _ 

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi ; 

Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. 

1 (bases 1 to 163163) 

Kimberley, A. 

Direct Submission ^ • * v. • 

Submitted (18-SEP-2001) Sanger Centre, Hinxton, Cambridgeshire, 
CB10 ISA, UK. E-mail enquiries: humquery@sanger.ac.uk Clone 
requests.: clonerequest@sanger.ac.uk 

On Apr 21, 2001 this sequence version replaced gi: 13396472. 
During sequence assembly data is compared from overlapping clones. 
Where differences are found these are annotated as variations 
together with a note of the overlapping clone name. Note that the 
variation annotation may not be found in the sequence submission 
corresponding to the overlapping clone, as we submit sequences with 
only a small overlap as described above. 

This sequence was finished as follows unless otherwise noted: all 
regions were either double -stranded or sequenced with an alternate 
chemistry or covered by high quality data (i.e., phred quality >- 
30); an attempt was made to resolve all sequencing problems, such 
as compressions and repeats; all regions were covered by at least 
one plasmid subclone or more than one M13 subclone; and the 
assembly was confirmed by restriction digest. The following 
abbreviations are used to associate primary accession numbers given 
in the feature table with their source databases: Em:, EMBL; Sw: , 
SWISSPROT; Tr : , TREMBL; Wp: , WORMPEP; Information on the WORMPEP 
database can be found at , 
htto : / /www . sanaer . ac . uk/ Proi ec t s /C elegans / wormpep This sequence 
was generated from part of bacterial clone contigs of human, 
chromosome 9, constructed by the Sanger Centre Chromosome 9 Mapping 
Group. Further information can be found at 
h t tp : / /www . sanger . ac . uk/HGP/Chr9 

RP11-503K16 is from the library RPCI-11.2 constructed by the group 
of Pieter de Jong. For further details see 
ht tp : / /www . chori . org/bacpac/home . htm 
VECTOR: pBACe3 .6 

This sequence is the entire insert of clone RP11-503K16 The true 
left end of clone RP11-134P18 is at 92104 in this sequence. The 
true right end of clone RP11-399M15 is at 92480 in this sequence. 
Location/Qualifiers 
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C 1: AL158150. Human DNA sequenc...[gi: 14160905] 



LOCUS 

DEFINITION 

ACCESSION 
VERSION 
KEYWORDS 
SOURCE 

ORGANISM 



REFERENCE 
AUTHORS 
TITLE 
JOURNAL 



COMMENT 



FEATURES 

source 



AL158150 168011 bp DNA linear PRI 18-MAY-2001 

Human DNA sequence from clone RP11-220B22 on chromosome 9, complete 
sequence . 
AL158150 

AL158150.14 GI: 14160905 
HTG. 

Homo sapiens (human) 
Homo sapiens 

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi ; 
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. 
1 (bases 1 to 168011) 
Skuce,C. 

Direct Submission 

Submitted < 18 -MAY-2 001) Sanger Centre, Hinxton, Cambridgeshire, 
CB10 ISA, UK. E-mail enquiries: humquery@sanger.ac.uk Clone 
requests: cldnerequest@sanger.ac.uk 

On May 20, 2001 this sequence version replaced gi:13446402. 
During sequence assembly data is compared from overlapping clones. 
Where differences are found these are annotated as variations 
together with a note of the overlapping clone name. Note that the 
variation annotation may not be found in the sequence submission 
corresponding to the overlapping clone, as we submit sequences with 
only a small overlap as described above. 

This sequence was finished as follows unless otherwise noted: all 
regions were either double -stranded or sequenced with an alternate 
chemistry or covered by high quality data (i.e., phred quality >= 
30); an attempt was made to resolve all sequencing problems, such 
as compressions and repeats; all regions were covered by at least 
one plasmid subclone or more than one Ml 3 subclone; and the 
assembly was confirmed by restriction digest. The following ■ ■ 
abbreviations are used to associate primary accession numbers given 
in the feature table with their source databases: Em:, EMBL; Sw: , 
SWISSPROT; Tr: , TREMBL; Wp: , WORMPEP; Information on the WORMPEP 
database can be found at 

ht to : / / www . Sanger .ac.uk/ Pro j ec t s /C eleqa ns /wormpep This sequence 
was generated from part of bacterial clone contigs of human 
chromosome 9, constructed by the Sanger Centre Chromosome 9 Mapping 
Group. Further information can be found at 
ht tp : / /www . Sanger . ac . uk/HGP/Chr9 

RP11-220B22 is from the library RPCI-11.1 constructed by the group 
of Pieter de Jong. For further details see 
ht tp : / /www . chori . org /bacpac/ home . htm 
VECTOR: pBACe3 . 6 

This sequence is the entire insert of clone RP11-220B22 The true 
left end of clone RP11-296P7 is at 58728 in this sequence. 

Location/Qualifiers 

1. .168011 



Links 
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D 1: AL442638. Homo sapiens chro...[gi: 18857863] 

LOCUS HS570H19 188247 bp DNA linear PRI 19-FEB-2002 

DEFINITION Homo sapiens chromosome 9 BAC RP11-570H19, complete sequence. 
ACCESSION AL442638 AL358947 
VERSION AL442638.3 GI:18857863 

KEYWORDS HTG . 

SOURCE Homo sapiens (human) 

ORGANISM Homo sapiens 

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi ; 
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. 
■REFERENCE 1 

AUTHORS Plumb, B . 

TITLE Direct Submission 

JOURNAL Submitted (24-AUG-2000) Sanger Centre, Hinxton, Cambridgeshire, 
CB10 ISA, UK. E-mail enquiries: humquery@sanger.ac.uk Clone 
requests: cloherequest@sanger.ac.uk 
REFERENCE 2 (bases 1 to 188247) • , 

AUTHORS Scharfe,M., Conrad, A., Hornischer , K. , Loehnert,T.H. , Tnies,S. ana 
Bloecker,H. 

TITLE Direct Submission 

JOURNAL Submitted (28-SEP-2000 ) GBF, Dept. of Genome Analysis, Mascheroder 
Weg 1, D-38124 Braunschweig, Germany, E-mail: info.genome@gbf.de 
COMMENT On Feb 21, 2002 this sequence version replaced gi: 11693452. 

All annotations in Jthis database entry are developed by 
computational tools. It is therefore not explicitly noted in the 
feature lines that evidence is not experimental. 
Mapping was performed at The Sanger Centre 
(cf. ht tp : / /www . sanger . ac . uk/HGP/Chr 9 ) 
Mapping information is available via 

ht tp : / /webace . sanger . ac . uk/cai -bin/display? db=acedb9&grep=570H19 

. Genome Center 

Center: GBF, Braunschweig 
Center code: GBF 

Web site: http://genome.gbf.de/ 
Contact : info . genome@gbf . de 

Project Information 

Center project name: 
Center clone name: bA570Hl9 

-. Summary Statistics 

Sequencing vector: ###; 

Chemistry: Dye-terminator-BigDye : 65% of reads 
Chemistry: Dye- terminator-amer sham: 31%- of reads 
Chemistry: Dye-primer-amersham: 4% of reads 
Assembly program: Phrap; version 0.990319 
Consensus quality: 0 bases at least Q40 
Consensus quality: 0 bases at least Q30 
Consensus quality: 0 bases at least Q20 
Estimated insert size: ##; agarose-fp estimation 



Links 
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The Sequence of the Human Genome 
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THE HUMAN GENOME 

A 2 91-bULion base pair (bp) const s sequence of the euchromatic portion of 
the human genome was generated by the whole-genome shotgun sequencing 
method. The 14.8-biUion bp DNA sequence was generated over 9 months from 
27 271 853 high-quality sequence reads (5.11 -fold coverage of the genome) 
from both ends of plasmid clones made from the DNA of five individuals. Two 
assembly strategies— a whole-genome assembly and a regional chromosome 
assembly— were used, each combining sequence data from Celera and the 
publicly funded genome effort. The public data were shredded .nto 550-bp 
segments to create a 2.9-fold coverage of those genome regions that had been 
sequenced, without including biases inherent in the cloning and assembly 
procedure used by the publicly funded group. This brought the effective cov- 
erage in the assemblies to eightfold, reducing the number and size of gaps in 
the final assembly over what would be obtained with 5.1 1-fold coverage. The 
two assembly strategies yielded very similar results that largely agree with 
independent mapping data. The assemblies effectively cover the euchromatic 
regions of the human chromosomes. More than 90% of the genome is in 
scaffold assemblies of 100,000 bp or more, and 25% of the genome is in 
scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed 
26 588 protein-encoding transcripts for which there was strong corroborating 
evidence and an additional -12,000 computationally derived genes with mouse 
matches or other weak supporting evidence. Although gene-dense dusters are 
obvious, almost half the genes are dispersed in low G+C sequence separated 
bv large tracts of apparently noncoding sequence. Only 1.1% of the genome 
is spanned by exons, whereas 24% is in introns, with 75% of the genome being 
intergenic DNA. Duplications of segmental blocks, ranging in size up to chro- 
mosomal lengths, are abundant throughout the genome and reveal a complex 
evolutionary history. Comparative genomic analysis indicates vertebrate ex- 
pansions of genes associated with neuronal function, with tissue-specific de- 
velopmental regulation, and with the hemostasis and immune systems DNA 
sequence comparisons between the consensus sequence and publicly funded 
genome data provided locations of 2.1 million single-nucleotide polymorphisms 
(SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 
1250 on average, but there was marked heterogeneity in the level of poly- 
morphism across the genome. Less than 1% of all SNPs resulted in variation in 
proteins, but the task of determining which SNPs have functional consequences 
remains an open challenge. 



Decoding of the DNA that constitutes the 
human genome has been widely anticipated 
for the contribution it will make toward un- 
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derstanding human evolution, the causation 
of disease, and the interplay between the 
environment and heredity in defining the hu- 
man condition. A project with the goal of 
determining the complete nucleotide se- 
quence of the human genome was first for- 
mally proposed in 1985 (7). In subsequent 
years, the idea met with mixed reactions in 
the scientific community (2). However, in 
1990, the Human Genome Project (HGP) was 
officially initiated in the United States under 
the direction of the National Institutes of 
Health and the U.S. Department of Energy 
with a 15-year, $3 billion plan for completing 
the genome sequence. In 1998 we announced 
our intention to build a unique genome- 
sequencing facility, to determine the se- 
quence of the human genome over a 3-year 
period. Here we report the penultimate mile- 
stone along the path toward that goal, a nearly 
complete sequence of the euchromatic por- 
tion of the human genome. The sequencing 
was performed by a whole-genome random 
shotgun method with subsequent assembly of 
the sequenced segments. 

The modem history of DNA sequencing 
began in 1977, when Sanger reported his meth- 
od for deterrnining the order of nucleotides of 



- AA using cham-terminating nucleotide ana- 
logs (5). In the same year, the first human gene 
was isolated and sequenced (4). In 1986, Hood 
and co-workers (5) described an improvement 
in the Sanger sequencing method that included 
attaching fluorescent dyes to the nucleotides, 
which permitted them to be sequentially read 
by a computer. The first automated DNA se- 
quencer, developed by Applied Biosystems in 
California in 1987, was shown to be successful 
when the sequences of two genes were obtained 
with this new technology (6). From early se- 
quencing of human genomic regions (7), it 
became clear that cDNA sequences (which are 
reverse-transcribed from RNA) would be es- 
sential to annotate and validate gene predictions 
in the human genome. These studies were the 
basis in part for the development of the ex- 
pressed sequence tag (EST) method of gene 
identification (8% which is a random selection, 
very high throughput sequencing approach to 
characterize cDNA libraries. The EST method 
led to the rapid discovery and mapping of hu- 
man genes (P). The increasing numbers of hu- 
man EST sequences necessitated the develop- 
ment of new computer algorithms to analyze 
large amounts of sequence data, and in 1993 at 
The Institute for Genomic Research (TIGR), an 
algorithm was developed that permitted assem- 
bly and analysis of hundreds of thousands of 
ESTs. This algorithm permitted characteriza- 
tion and annotation of human genes on the basis 
of 30,000 EST assemblies (10). 

The complete 49-kbp bacteriophage lamb- 
da genome sequence was determined by a 
shotgun restriction digest method in 1982 
(77). When considering methods for sequenc- 
ing the smallpox virus genome in 1991 (72), 
a whole-genome shotgun sequencing method 
was discussed and subsequently rejected ow- 
ing to the lack of appropriate software tools 
for genome assembly. However, in 1994, 
when a microbial genome-sequencing project 
was contemplated at TIGR, a whole-genome 
shotgun sequencing approach was considered 
possible with the TIGR EST assembly algo- 
rithm. In 1995, the 1.8-Mbp Haemophilus 
influenzae genome was completed by a 
whole-genome shotgun sequencing method 
(13). The experience with several subsequent 
genome-sequencing efforts, established the 
broad applicability of this approach (7 4 y 1 5). 

A key feature of the sequencing approach 
used for these megabase-size and larger ge- 
nomes was the use of paired-end sequences 
(also called mate pairs), derived from sub- 
clone libraries with distinct insert sizes and 
cloning characteristics. Paired-end sequences 
are sequences 500 to 600 bp in length from 
both ends of double-stranded DNA clones of 
prescribed lengths. The success of using end 
sequences from long segments (18 to 20 kbp) 
of DNA cloned into bacteriophage lambda in 
assembly of the microbial genomes led to the 
suggestion (16) of an approach to simulta- 
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neously map and sequence the human ge- 
nome by means of end sequences from 150- 
kbp bacterial artificial chromosomes (BACs) 
(17, 18). The end sequences spanned by 
known distances provide long-range continu- 
ity across the genome. A modification of the 
BAC end-sequencing (BES) method was ap- 
plied successfully to complete chromosome 2 
from the Arabidopsis thaliana genome (19). 

In 1997, Weber and Myers (20) proposed 
whole-genome shotgun sequencing of the 
human genome. Their proposal was not well 
received (21). However, by early 1998, as 
less than 5% of the genome had been se- 
quenced, it was clear, that the rate of progress 
. in human genome sequencing worldwide 
was very slow (22), and the prospects for 
finishing the genome by the 2005 goal were 
uncertain. 

In early 1998, PE Biosystems (now Applied 
Biosystems) developed an automated, high- 
throughput capillary DNA sequencer, subse- 
quently called the ABI PRISM 3700 DNA 
Analyzer. Discussions between PE Biosystems 
and TIGR scientists resulted in a plan to under- 
take the sequencing of the human genome with . 
the 3700 DNA Analyzer and the whole-genome 
shotgun sequencing techniques developed at 
TIGR (23). Many of the principles of operation 
of a genome-sequencing facility were estab- 
lished in the TIGR facility (24). However, the 
facility envisioned for Celera would have a 
capacity roughly 50 times that of TIGR, and 
thus new developments were required for sam- 
ple preparation and tracking and for whole- 
genome assembly. Some argued that the re- 
quired 150-fold scale-up from the H. influenzae 
genome to the human genome with its complex 
repeat sequences was not feasible (25). The 
Drosophila melanogaster genome was thus 
chosen as a test case for whole-genome assem- 
bly on a large and complex eukaryotic genome. 
In collaboration with Gerald Rubin and the 
Berkeley Drosophila Genome Project, the nu- 
cleotide sequence of the 120-Mbp euchromatic 
portion of the Drosophila genome was deter- 
mined over a 1-year period (26-28). The Dro- 
sophila genome-sequencing effort resulted in 
two key findings: (i) that the assembly algo- 
rithms could generate chromosome assemblies 
with highly accurate order and orientation with 
substantially less than 10-fold coverage, and (ii) 
that undertaking multiple interim assemblies in 
place of one comprehensive final assembly was 
not of value. 

These findings, together with the dramatic 
changes in the public genome effort subsequent 
to the formation of Celera (29), led to a modi- 
ied whole-genome shotgun sequencing ap- 
proach to the human genome. We initially pro- 
posed to do 10-fold sequence coverage of the 
genome over a 3-year period and to make in- 
terim assembled sequence data available quar- 
terly. The modifications included a plan to per- 
form random shotgun sequencing to —5-fold 
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coverage and to use the unordered and unori- 
ented BAC sequence fragments and subassem- 
blies published in GenBank by the publicly 
funded genome effort (30) to accelerate the 
project. We also abandoned the quarterly an- 
. nouncements in the absence of interim assem- 
blies to report. 

Although this strategy provided a reason- 
able result very early that was consistent with a 
whole-genome shotgun assembly with eight- 
. fold coverage, the human genome sequence is 
not as finished as the Drosophila genome was 
with an effective 13-fold coverage. However, it 
became clear that even with this reduced cov- 
erage strategy, Celera could generate an accu- 
rately ordered and oriented scaffold sequence of 
the human genome in less than 1 year. Human 
genome sequencing was initiated 8 September 
. 1999 and completed 17 June 2000. The first 
assembly was completed 25 June 2000, and the 
assembly reported here was completed 1 Octo- 
. ber 2000. Here we describe the whole-genome 
. • random shotgun sequencing effort applied to 
the human genome. We developed two differ- 
ent assembly approaches for assembling the --3 
. billion bp that make up the 23 pairs of chromo- 
somes of the Homo sapiens genome. Any Gen- 
Bank-derived data were shredded to remove 
potential bias to the final sequence from chi- 
meric clones, foreign DNA contamination, or 
misassembled contigs. Insofar as a correctly - * 
and accurately assembled genome sequence 
with faithful order and orientation of contigs 
is essential for an accurate analysis of the 
human genetic code, we have devoted a con- 
siderable portion of this manuscript to the 
documentation of the quality of our recon- 
struction of the genome. We also describe our 
preliminary analysis of the human genetic 
code on the basis of computational methods. . 
Figure 1 (see fold-out chart associated with . 
this issue; files for each chromosome can be 
found in Web fig. 1 on Science Online at 
www.sciencemag.org/cgi/content/full/291/ 
5507/1 304/DC1) provides a graphical over- 
view of the genome and the features encoded 
in it. The detailed manual curation and inter- 
pretation of the genome are just beginning. 

To aid the reader in locating specific an- 
alytical sections, we have divided the paper 
into seven broad sections. A summary of the 
major results appears at the beginning of each 
section. 

1 Sources of DNA and Sequencing Methods 

2 Genome Assembly Strategy and 
Characterization 

3 Gene Prediction and Annotation 

4 Genome Structure 

5 Genome Evolution 

6 A Genome-Wide Examination of 
Sequence Variations 

7 An Overview of the Predicted Protein- 
Coding Genes in the Human Genome 

8 Conclusions 
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Summary. This section discusses the rnn'onaW 
and ethical rules governing donor selection to 
ensure ethnic and gender diversity atony wjtj, 

* ,:.the methodologies for DNA extraction and 
brary construction. The plasmid library con- 
struction is the first critical step i n shotgun 
sequencing. If the DNA libraries are not un> 
form in size, nonchimeric, and do not randomly 
represent the genome, then the subsequent steps 
cannot accurately reconstruct the genome se- 
quence. We used automated high-throughput 
DNA sequencing and the computational infra. 

: structure to enable efficient tracking of enor- 
mous amounts of sequence information (27.3 
million sequence reads; 14.9 billion bp of sc. 
quence). Sequencing and tracking from both 

. ends of plasmid clones mom 2-, 1 0-, and 50-kbp 
libraries were essential to the computational 
reconstruction of the genome. Our evidence 

- indicates that the accurate pairing rate of end 

. sequences was greater than 98%. 

Various policies of the United States and the 
World Medical Association, specifically lhe 
Declaration of Helsinki, offer recommenda- 
tions for conducting experiments with human 
subjects.. We convened , an Institutional Re- 
. view Board (IRB) (31) that helped us estab- 
lish the protocol for obtaining and using hu- 
man DNA and the informed consent process 
used to enroll research volunteers for the 
DNA-sequencing studies reported here. We 
adopted several steps and procedures to pro- 
tect the privacy rights and confidentiality of 
the research subjects (donors). These includ- 
ed a two-stage consent process, a secure ran- 
dom alphanumeric coding system for speci- 
mens and records, circumscribed contact with 
the subjects by researchers, and options for 
off-site contact of donors. In addition, Celera 
applied for and received a Certificate of Con- 
fidentiality from the Department of Health 
and Human Services. This Certificate autho- 
rized Celera to protect the privacy of the 
individuals who volunteered to be donors as 
provided in Section 301(d) of the Public 
Health Service Act 42 U.S.C. 241(d). 

Celera and the IRB believed that the ini- 
tial version of a completed human genome 
should be a composite derived from multiple 
donors of diverse ethnic backgrounds Pro- 
spective donors were asked, on a voluntary 
basis, to self-designate an ethnogeographic 
category (e.g., African- American, Chinese, 
Hispanic, Caucasian, etc.). We enrolled 21 
donors (32). 

Three basic items of information from 
each donor were recorded and linked by con- 
fidential code to the donated sample: age, 
sex, and self-designated ethnogeographic 
group. From females, —130 ml of whole, 
heparinized blood was collected. From males, 
-130 ml of whole, heparinized blood was 
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collected, as well as five specimens of se;^ 
collected over a 6-week period. Permanent 
lymphoblastoid cell lines were created by 
Epstein-Barr virus immortalization. DNA 
from five subjects was selected for genomic 
pNA sequencing: two males and three fe- 
males — one African-American, one Asian- 
Chinese, one Hispanic-Mexican, and two 
Caucasians (see Web fig. 2 on Science Online 
at www.sciencemag.org/cgi/content/291/5507/ 
1304/DC1). The decision of whose DNA to 
sequence was based on a complex mix of fee- 
. tors, including the goal of achieving diversity as 
well as technical issues such as the quality of 
the DNA libraries and availability of immortal- 
ized cell lines. 

1.1 Library construction and 
sequencing 

Central to the whole-genome shotgun sequenc- 
ing process is preparation of high-quahty plas- 
mid libraries in a variety of insert sizes so that 
pairs of sequence reads (mates) are obtained, 
one read from both ends of each plasmid insert. 
High-quality libraries have an equal representa- 
tion of all parts of the genome, a small number 
of clones without inserts, and no contamination 
from such sources as the mitochondrial genome 
and Escherichia coli genomic DNA. DNA from 
each donor was used to construct plasmid librar- 
ies in one or more of three size classes: 2 kbp, 1 0 
kbp, and 50 kbp (Table 1) (35). 

In designing the DNA-sequencing pro- 
cess, we focused on developing a simple 
| system that could be implemented in a robust 
^and reproducible manner and monitored ef- 
fectively (Fig. 2) (34). 

Qirrent sequencing protocols are based on 
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the dideoxy sequencing method (35), which 
typically yields only 500 to 750 bp of sequence 
per reaction. This limitation on read length has 
made monumental gains in throughput a pre- 
requisite for the analysis of large eukaryotic 

■ genomes. We accomplished this at the Celera 
facility, which occupies about 30,000 square 
feet of laboratory space and produces sequence 
data continuously at a rate of 175,000 total 
reads per day. The DNA-sequencing facility is 
supported by a high-performance computation- 
al facility (36). 

• . • The process for DNA sequencing was mod- 
ular by design and automated. Intermodule 
sample backlogs allowed four principal 
modules to operate independently: (i) li- 
brary transformation, plating, and colony 
picking; (ii) DNA template preparation; 
(iii) dideoxy sequencing reaction set-up 
and purification; and (iv) sequence deter- 
mination with the ABI PRISM 3700 DNA 
Analyzer. Because the inputs and outputs 
of each module have been carefully 
matched and sample backlogs are continu- 
ously managed, sequencing has proceeded 
without a single day's interruption since the 
initiation of the Drosophila project in May 
1999. The ABI 3700 is a fully automated 
capillary array sequencer and as such can 
be operated with a minimal amount of 

. hands-on time, currently estimated at about 
15 min per day. The capillary system also 
facilitates correct associations of sequenc- 
ing traces with samples through the elimi- 
nation of manual sample loading and lane- 
tracking errors associated with slab gels. 

-About 65 production staff were hired and 

■ trained, and were rotated on a regular basis 



rough the four production modules. A 
central laboratory information management 
system (LIMS) tracked all sample plates by 
unique bar code identifiers. The facility was 
. supported by a quality control team that per- 
formed raw material and in-process testing 
and a quality assurance group with responsi- 
bilities including document control, valida- 
tion, and auditing of the facility. Critical to 
the success of the scale-up was the validation 
of all software and instrumentation before 
implementation, and production-scale testing 
of any process changes. 

1.2 Trace processing 

An automated trace-processing pipeline has 
been developed to process each sequence file 
(37). After quality and vector. trimming, the 
average trimmed sequence length was 543 
bp, and the sequencing accuracy was expo- 
nentially distributed with a mean of 99.5% 
and with less than 1 in 1000 reads being less 
than 98% accurate (26). Each trimmed se- 
quence was screened for matches to contam- 
inants including sequences of vector alone, E. 
coli genomic DNA, and human mitochondri- 
al DNA. The entire read for any sequence 
with a significant match to a contaminant was 
. discarded. A total of 713 reads matched E. 
coli genomic DNA and 2114 reads matched 
the human mitochondrial genome. . . « 

1.3 Quality 1 assessment and control 

The importance of the base-pair level ac- 
curacy of the sequence data increases as the 
size and repetitive nature of the genome to 
be sequenced increases. . Each sequence 
read must be placed uniquely in the ge- 



Table 1. Celera-generated data input into assembly. 



om 
on- 

>hic 
3le, 
les,| 
vas 



No. of sequencing reads 



Fold sequence coverage 
(2.9-Gb genome) 



Fold clone coverage 



Insert size* (mean) 
Insert size* (SD) 
i % Matesf 



Number of reads for different insert libraries 



Individual 



2 kbp 



10 kbp 



50 kbp 



Total 



A 


0 


0 


B 


11,736.757 


7.467.755 


C 


853.819 


881,290 


D 


952.523 


1,046,815 


F 


0 


1,498,607 


Total 


13.543.099 


10,894,467 


A 


0 


. 0 


B 


2.20 


1.40 


C 


0.16 


1.17 


D 


0.18 


0.20 


F 


0 


0.28 


Total 


2.54 


2.04 


A 


0 


0 


B 


2.96 


11.26 


C 


0.22 


1.33 


D 


0.24 


1.58 


F 


0 


2.26 


Total 


3.42 


16.43 


Average 


1,951 bp 


10,800 bp 


Average 


6,10% 


8.10% 


Average 


74.50 


80.80 



2,767,357 
66,930 
0 
0 
0 

2.834,287 
0.52 
0.01 
0 
0 
0 

0.53 
18.39 

0.44 
0 
0 
0 

18.84 
50.715 bp 
14.90% 
75.60 



2.767,357 
19,271,442 
1,735,109 
1,999,338 
1,498,607 
27,271.853 
0.52 
. 3.61 
0.32 
0.37 
0.28 
5.11 
18.39 
14.67 
1.54 
1.82 
2.26 
38.68 



Total number of 
base pairs 



1,502,674,851 
10.464,393,006 
942,164,187 
1,085,640,534 
813,743,601 
14,808,616.179 



•Insert sire and SD are calculated from assembly of mates on contigs. t% Mates is based on laboratory tracking of sequencing runs. 
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nome, and even a modest error rate can 
reduce the effectiveness of assembly. In 
addition, maintaining the validity of mate- 
pair information is absolutely critical for 
the algorithms described below. Procedural 
controls were established for maintaining 
the validity of sequence mate-pairs as se- 
quencing reactions proceeded through the 
process, including strict rules built into the 
LIMS. The accuracy of sequence data pro- 
duced by the Celera process was validated 
in the course of the Drosophila genome 
project (26). By collecting data for the 



entire human genome in a single facility,- 
we were able to ensure uniform quality 
standards and the cost advantages associat- 
ed with automation, an economy of scale, 
and process consistency. 

2 Genome Assembly Strategy and 
Characterization 

Summary. We describe in this section the two 
approaches that we used to assemble the ge- 
nome. One method involves the computational 
combination of all sequence reads with shred- 
ded data from GenBank to generate an indepen- 



dent, nonbiased view of the genome. The sec- 
ond approach involves clustering all of the frag- 
ments to a region or chromosome on the basis 
of mapping information. The clustered data 
were then shredded and subjected to computa- 
tional assembly. Both approaches provided es- 
sentially the same reconstruction of assembled 
DNA sequence with proper order and orienta- 

,tion. The second method provided slightly 
greater sequence coverage (fewer gaps) and 
was the principal sequence used for the analysis 

. phase. In addition, we document the complete- 
ness and correctness of this assembly process 
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Fig. 2. Flow diagram for sequencing pipeline. Samples are received, 
selected, and processed in compliance with standard operating proce- 
dures, with a focus on quality within and across departments. Each 
process has defined inputs and outputs with the capability to exchange 



samples and data with both internal and external entities according to 
defined quality guidelines. Manufacturing pipeline processes, products, 
quality control measures, and responsible parties are indicated and are 
described further in the text. 
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and provide a comparison to the public gen' ~ 
sequence, which was reconstructed largel> , 
an independent BAC-by-BAC approach. Our 
assemblies effectively covered the euchromatic 
regions of the human chromosomes. More than 
►0% of the genome was in scaffold assemblies 
f 100,000 bp or greater, and 25% of the ge- 
nome was in scaffolds of 10 million bp or 
larger. 

Shotgun sequence assembly is a classic 
example of an inverse problem: given a set 
of reads randomly sampled from a target 
sequence, reconstruct the order and the po- 
sition of those reads in the target. Genome 
assembly algorithms developed for Dro- 
sophila have now been extended to assemble 
the —25-fold larger human genome. Celera as- 
semblies consist of a set of contigs that are 
ordered and oriented into scaffolds that are then 
mapped to chromosomal locations by using 
known markers. The contigs consist of a col- 
lection of overlapping sequence reads that pro- 
vide a consensus reconstruction for a contigu- 
ous interval of the genome. Mate pairs are a 
central component of the assembly strategy. 
They are used to produce scaffolds in which the 
size of gaps between consecutive contigs is 
known with reasonable precision. This is ac- 
complished by observing that a pair of reads, 
one of which is in one contig, and the other of 
which is in another, implies an orientation and 
distance between the two contigs (Fig. 3). Fi- 
nally, our assemblies did not incorporate all 
into the final set of reported scaffolds, 
set of unincorporated reads is termed 
"chaff," and typically consisted of reads from 
within highly repetitive regions, data from other 
organisms introduced through various routes as 
found in many genome projects, and data of 
poor quality or with untrimmed vector. 
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2.1 Assembly data sets 

We used two independent sets of data for our 
assemblies. The first was a random shotgun 
data set of 27.27 million reads of average length 
543 bp produced at Celera. This consisted 
largely of mate-pair reads from 16 libraries 
constructed from DNA samples taken from five 
different donors. Libraries with insert sizes of 2, 
10, and 50 kbp were used. By looking at how 
mate pairs from a library were positioned in 
known sequenced stretches of the genome, we 
were able to characterize the range of insert 
.sizes in each library and determine a mean and 
standard deviation. Table 1 details the number 
of reads, sequencing coverage, and clone cov- 
. erage achieved by the data set The clone cov- 
erage is the coverage of the genome in cloned 
DNA, considering the entire insert of each 
clone that has sequence from both ends. The 
clone coverage provides a measure of the 
amount of physical DNA coverage of the ge- 
nome. Assuming a genome size of 2.9 Gbp, the 
Celera trimmed sequences gave a 5. IX cover- 
age of the genome, and clone coverage was 
3.42X, 16.40X, and 18.84X for the 2-, 10-, and 
50-kbp libraries, respectively, for a total of 
38.7X clone coverage. 

The second data set was from the publicly 
funded Human Genome Project (PFP) and is 
primarily derived from BAC clones (30). The 
BAC data input to the assemblies came from a 
download of GenBank on 1 September 2000 
(Table 2) totaling 4443.3 Mbp of sequence. 
The data for each BAC is deposited at one of 
four levels of completion. Phase 0 data are a set 
of generally unassembled sequencing reads 
from a very light shotgun of the BAC, typically 
less than IX. Phase 1 data are unordered as- 
semblies of contigs, which we call BAC contigs 
or bactigs. Phase 2 data are ordered assemblies 
of bactigs. Phase 3 data are complete BAC 
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Fig. 3. Anatomy of whole-genome assembly. Overlapping shredded bactig fragments (red lines) and 
internally derived reads from five different individuals (black lines) are combined to oroduce a 
.contig and a consensus sequence (green line). Contigs are connected into scaffolds (red) by using 
Imate pair information. Scaffolds are then mapped to the genome (gray line) with STS (blue star) 
"physical map information. 



juences. In the past 2 years the PFP has 
A oicused on a product of lower quality and com- 
pleteness, but on a faster time-course, by con- 
centrating on the production of Phase 1 data 
from a 3X to 4X light-shotgun of each BAC 
clone. 

We screened the bactig sequences for con- 
taminants by using the BLAST algorithm 
against three data sets: (i) vector sequences 
in Univec core (38), filtered for a 25 -bp 
match at 98% sequence identity at the ends 
of the sequence and a 30-bp match internal 
to the sequence; (ii) the nonhuman portion, 
of the High Throughput Genomic (HTG) 
Seqences division of GenBank (39), fil- 
tered at 200 bp at 98%; and (iii) the non- 
redundant nucleotide sequences from Gen- 
Bank without primate and human virus en- 
tries, filtered at 200 bp at 98%. Whenever 
25 bp or more of vector was found within 
50 bp of the end of a contig, the tip up to 
the matching vector was excised. Under 
these criteria we removed 2.6 Mbp of pos- 
sible contaminant and vector from the 
Phase 3 data, 61.0 Mbp from the Phase 1 
and 2 data, and 16.1 Mbp from the Phase 0 
data (Table 2). This left us with a total of 
4363.7 Mbp of PFP sequence data 20% 
finished, 75% rough-draft (Phase 1 and 2), 
and 5% single sequencing reads (Phase 0). 
An additional 104,018 BAC end-sequence 
mate pairs were also downloaded and in- 
cluded in the' data sets for both assembly 
processes (18). 

2.2 Assembly strategies 

Two different approaches to assembly were 
pursued. The first was a whole-genome as- 
sembly process that used Celera data and the 
PFP data in the form of additional synthetic 
shotgun data, and the second was a compart- 
mentalized assembly process that first parti- 
tioned the Celera and PFP data into sets 
localized to large chromosomal segments and 
then performed ab initio shotgun assembly on 
each set. Figure 4 gives a schematic of the 
overall process flow. 

For the whole-genome assembly, the PFP 
data was first disassembled or "shredded" into a 
synthetic shotgun data set of 550-bp reads that 
form a perfect 2X covering o£the bactigs. This 
resulted in 16.05 million "faux" reads that were 
sufficient to cover the genome 2.96X because 
of redundancy in the BAC data set, without 
incorporating the biases inherent in the PFP 
assembly process. The combined data set of 
43.32 million reads (8X), and all associated 
mate-pair information, were then subjected to 
our whole-genome assembly algorithm to pro- 
duce a reconstruction of the genome. Neither 
the location of a BAC in the genome nor its 
assembly of bactigs was used in this process. 
Bactigs were shredded into reads because we 
found strong evidence that 2.13% of them were 
misassembied (40). Furthermore, BAC location 
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information was ignored because some BACs 
were not correctly placed on the PFP physical 
map and because we found strong evidence that 

Table 2. CenBank data input into assembly. 
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at least 2.2% of the BACs contained sequence 
data that were not part of the given BAC (41% 
possibly as a result of sample-tracking errors 



Completion phase sequence 



Center 


Statistics 


0 


1 and 2 


3 


Whitehead Institute/ 


Number of accession records 


2,825 


6,533 


363 


MIT Center for 


Number of con tigs 


243,786 


138,023 


363 


Genome Research, 


Total base pairs 


194,490,158 


1,083,848,245 


48.829,358 


USA 


Total vector masked (bp) 
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4,417,055 


98,028 




(bp) 










Average contig length (bp) 


798 


7,853 
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Washington University, 


Number of accession records 


19 
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USA 


Number of contigs 
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Total base pairs 
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(bp) 










Average contig length (bp) 
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9.079 
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Baylor College of 


• Number of accession records 


0 


1,626 


363 


Medicine, USA 


Number of contigs 


0 


44,861 


363 




Total base pairs 


0 
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49,017,104 




Total vector masked (bp) 


0 


218,769 
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Total contaminant masked 


o 
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(bp) 










Average contig length (bp) 


0 


5,919 
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Production Sequencing 


Number of accession records 


135 


2,043 
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Facility, DOE Joint 


Number of contigs 


7,052 


34.938 


754 


Genome Institute, 


Total base pairs 


8,680,214 


294,249,631 
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USA 


Total vector masked (bp) 


. 22,644 


162,651 
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(bp) 
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Number of accession records 
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1,149 
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o 
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Total base pairs 


o 
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20,093,926 


Japan 


Total vector masked (bp) 


0 
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Total contaminant masked (bp) 


0 
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Average contig length (bp) 


0 
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Number of accession records 


3.021 


21,015 


9,137 




Number of contigs 


258,943 


409,628 


9,137 




Total base pairs 


209,930,983 3,360,047,574 


335,722,268 




Total vector masked (bp) 


1,655,293 


2,438,575 


82,284 




Total contaminant masked 


14.918,135 


16,311,664 


3.365,230 




(bp) 






91,466 




Average contig length (bp) 


811 


8,203 



•Other centers contributing at least 0.1% of the sequence include: Chinese National Human Genome Center 
Genomanatyse Gesellschaft fuer Biotechnologische Forschung mbH; Genome Therapeutics Corporation; CENOSCOPE; 
Chinese Academy of Sciences; Institute of Molecular Biotechnology; Keio University School of Medicine; Lawrence 
Uvermore National Laboratory; Cold Spring Harbor Laboratory; Los Alamos National Laboratory; Max-Planck Insthut fuer 
Molekulare, Genetik; Japan Science and Technology Corporation; Stanford University; The Institute for Genomic 
Research; The Institute of Physical and Chemical Research, Gene Bank; The University of Oklahoma; University of Texas 
Southwestern Medical Center, University of Washington. fThe 4.405.700.825 bases contributed by all centers were 
shredded into faux reads resulting in 2.96X coverage of the genome. 



(see below). In short, we performed a true, ab 
initio whole-genome assembly in which 
took the expedient of deriving additional se- 
quence coverage, but not mate pairs, assembled 
bactigs, or genome locality, from some extcr. 
nally generated data. 

In the compartmentalized shotgun assembly 
(CSA), Celera and PFP data were partitioned 
into the largest possible chromosomal segments 
or "components" that could be determined with 
confidence, and then shotgun assembly was ap- 
plied to each partitioned subset wherein the 
bactig data were again shredded into faux reads 
to ensure an independent ab initio assembly of 
the component. By subsetting the data in this 
way, the overall computational effort was re- 
duced and the effect of interchromosomal dupli- 
cations was ameliorated This also resulted in a 
reconstruction of the genome that was relatively 
independent of the whole-genome assembly re- 
sults so that the two assemblies could be com- 
pared for consistency. The quality of the parti- 
tioning into components was crucial so that 
different genome regions were not mixed to- 
gether. We constructed components from (i) the 
longest scaffolds of the sequence from each 
BAC and (ii) assembled scaffolds of data unique 
to Celera's data set The BAC assemblies were 
obtained by a combining assembler that used the 
bactigs and the 5 X Celera data mapped to those 
bactigs as input This effort was undertaken as 
an interim step solely because the more accurate 
and complete the scaffold for a given sequence 
stretch, the more accurately one can tile these 
scaffolds into contiguous components on the 
basis of sequence overlap and mate-pair infor- 
mation. We further visually inspected and cu- 
rated the scaffold tiling of the components lo 
further increase its accuracy. For the final CSA 
assembly, all but the partitioning was ignored, 
and an independent, ab initio reconstruction of 
the sequence in each component was obtained 
by applying our whole-genome assembly algo- 
rithm to the partitioned, relevant Celera data tmcl 
the shredded, faux reads of the partitioned, rel- 
evant bactig data. 

2.3 Whole-genome assembly 

The algorithms used for whole-gcnomc as- 
sembly (WGA) of the human genome were 
enhancements to those used to produce the 
sequence of the Drosophila genome reported 
in detail in (28). , 

The WGA assembler consists of a pipeline 
composed of five principal stages: Screener. 
Overlapper, Unitigger, Scaffolder, and Repeal 
Resolver, respectively. The Screener finds 
and marks all microsatellite repeats with less 
than a 6-bp element, and screens out a 
known interspersed repeat elements, includ- 
ing Alu, Line, and ribosomal DNA. Mnrkcii 
regions get searched for overlaps, whereas 
screened regions do not get searched, but ca 
be part of an overlap that involves unscrccnc 
matching segments. 
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The Overlapper compares every u ~ 
against every other read in search of complete 
end-to-end overlaps of at least 40 bp and with 
no more than 6% differences in the match. 

•ecause ail data are scrupulously vector- 
immed, the Overlapper can insist on com- 
plete overlap matches. Computing the set of 
all overlaps took roughly 10,000 CPU hours 
with a suite of four-processor Alpha SMPs 
with 4 gigabytes of RAM. This took 4 to 5 
days in elapsed time with 40 such machines 
operating in parallel. 

Every overlap computed above is statisti- 
cally a l-in-10 17 event and thus not a coinci- 
dental event. What makes assembly combi- 
natorially difficult is that while many over- 
laps are actually sampled from overlapping 
regions of the genome, and thus imply that 
the sequence reads should be assembled to- 
gether, even more overlaps are actually from 
two distinct copies of a low-copy repeated 
element not screened above, thus constituting 
an error if put together. We call the former 
"true overlaps" and the latter "repeat-induced 
overlaps." The assembler must avoid choos- 
ing repeat-induced overlaps, especially early ; 
in the process. 

We achieve this objective in the Unitig- 
ger. We first find all assemblies of reads that 
appear to be uncontested with respect to all 
other reads. We call the contigs formed from 
these subassemblies unitigs (for uniquely as- 
sembled con tigs) . Formally, these unitigs are 
uncontested interval subgraphs of the 
? ^S^ a P n of a11 overlaps {42). Unfortunately, al- 
though empirically many of these assemblies 
are correct (and thus involve only true over- 
laps), some are in fact collections of reads 
from several copies of a repetitive element 
that have been overcollapsed into a single 
subassembly. However, the overcollapsed 
unitigs are easily identified because their av- 
erage coverage depth is too high to be con- 
sistent with the overall level of sequence 
coverage. We developed a simple statistical 
discriminator that gives the logarithm of the 
odds ratio that a unitig is composed of unique 
DNA or of a repeat consisting of two or more 
copies. The discriminator, set to a sufficiently 
stringent threshold, identifies a subset of the 
unitigs that we are certain are correct In 
addition, a second, less stringent threshold 
identifies a subset of remaining unitigs very 
likely to be correctly assembled, of which we 
select those that will consistently scaffold 
(see below), and thus are again almost certain 
to be correct. We call the union of these two 
sets U-unitigs. Empirically, we found from a 
6x simulated shotgun of human chromosome 
22 that we get U-unitigs covering 98% of the 
stretches of unique DNA that are >2 kbp 
^bng. We are further able to identify the 
^^ndary of the start of a repetitive element 
^^Rhe ends of a U-unitig and leverage this so 
that U-unitigs span more than 93% of all 
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singly interspersed Alu elements and other 
100-to 400-bp repetitive segments. 

The result of running the Unitigger was 
thus a set of correctly assembled subcontigs 
•covering an estimated 73.6% of the human 
genome. The Scaffolder then proceeded to 
use mate-pair information to link these to- 
gether into scaffolds. When there are two or 
more mate pairs that imply that a given pair 
of U-unitigs are at a certain distance and 
orientation with respect to each other, the 
probability of . this being wrong is again 
roughly 1 in 10 10 , assuming that mate pairs 
are false less than 2% of the time. Thus, one 
can with high confidence link together all 
U-unitigs that are linked by at least two 2- or 
10-kbp mate pairs producing intermediate- . 
sized scaffolds that are then recursively 
linked together by confirming 50-kbp mate 
pairs and BAC end sequences. This process 
yielded scaffolds that are on the order of 
megabase pairs in size with gaps between 
their contigs that generally correspond to re- 
petitive elements and occasionally to small 
sequencing gaps. These scaffolds reconstruct 
, the majority of the unique sequence within a : 
genome. 

For the Drosophila assembly, we engaged 
in a three-stage repeat resolution strategy 
where each, stage was progressively more 



5.1 1X Cetera Reads 
. 39X mate pairs 



aggressive and thus more likely to make a 
mistake. For the human assembly, we contin- 
ued to use the first "Rocks" substage where 
all unitigs with a good, but not definitive, 
discriminator score are placed in a scaffold 
gap. This was done with the condition that 
two or more mate pairs with one of their 
reads already in the scaffold unambiguously 
place the unitig in the given gap. We estimate 
■ the probability of inserting a unitig into an 
incorrect gap with this strategy to be less than 
10~ 7 based on a probabilistic analysis. 

We revised the ensuing "Stones" substage 
of the human assembly, making it more like 
the mechanism suggested in our earlier work 
(43). For each gap, every read R that is placed 
in the gap by virtue of its mated pair M being 
in a contig of the scaffold and implying R's 
placement is collected. Celera's mate-pairing 
information is correct more than 99% of the 
time. Thus, almost every, but not all, of the 
reads in the set belong in the gap, and when 
a read does not belong it rarely agrees with 
the remainder of the reads. Therefore, we 
simply assemble this set of reads within the 
gap, eliminating any reads that conflict with 
the assembly. This operation proved much 
more reliable than the one it replaced for the 
Drosophila assembly; in the assembly of a 
simulated shotgun data set of human chromo- 



Public Bactfqs 
(from 33.421 BACsl 




Bactigs & Cetera pairs 
(binned by BAC) 




Components 1 




Components 2 




K Components,, 





WGA Assembly CSA Assembly 

Fig. 4. Architecture of Celera's two-pronged assembly strategy. Each oval denotes a computation 
process performing the function indicated by its label, with the labels on arcs between ovals 
describing the nature of the objects produced and/or consumed by a process. This figure 
summarizes the discussion in the text that defines the terms and phrases used. 
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some 22, all stones were placed correctly. 

The final method of resolving gaps is to 
fill them with assembled BAC data that cover 
the gap. We call this external gap "walking." 
We did not include the very aggressive "Peb- 
bles" substage described in our Drosophila 
work, which made enough mistakes so as to 
produce repeat reconstructions for long inter- 
spersed elements whose quality was only 
99.62% correct. We decided that for the hu- 
man genome it was philosophically better not 
to introduce a step that was certain to produce 
less than 99.99% accuracy. The cost was a 
somewhat larger number of gaps of some- 
what larger size. 

At the final stage of the assembly process, 
and also at several intermediate points, a 
consensus sequence of every contig is pro- 
duced. Our algorithm is driven by the princi- 
ple of maximum parsimony, with quality- 
value-weighted measures for evaluating each 
base. The net effect is a Bayesian estimate of 
the correct base to report at each position. . 
Consensus generation uses Celera data when- . 
ever it is present. In the event that no Celera 
data cover a given region, the BAC data . 
sequence is used. 

A key element of achieving a WGA of the 
human genome was to parallelize the Overlap- 
per and the central consensus sequence-con- 
structing subroutines. In addition, memory was 
a real issue— a straightforward application of 
the software we had built for Drosophila would 
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- have required a computer with a 600-gigabyte 
RAM. By making the Overlapper and Unitigger 
incremental, we were able to achieve the same 
computation with a maximum of instantaneous 
usage of 28 gigabytes of RAM. Moreover, the 
incremental nature of the first three stages al- 
lowed us to continually update the state of this 
part of the computation as data were delivered 
and then perform a 7-day run to complete Scaf- 
folding and Repeat Resolution whenever de- 
sired For our assembly operations, the total 
.- compute infrastructure consists of 10 four-pro- 
cessor SMPs with 4 gigabytes of memory per 
cluster (Compaq's ES40, Regatta) and a 16- 
processor NUMA machine with 64 gigabytes 
of memory (Compaq's GS160, Wildfire). The 
total compute for a run of the assembler was 
roughly 20,000 CPU hours. 

The assembly of Celera's data, together 
with the shredded bactig data, produced a set of 
scaffolds totaling 2.848 Gbp in span and con- 
sisting of 2.586 Gbp of sequence. The chaff, or 
set of reads not incorporated in the. assembly, 
numbered 11.27 million (26%), which is con- 
sistent with our experience for Drosophila. 
. More than 84% of the genome was covered by 
scaffolds >.l 00 kbp long, and these averaged 
91% sequence and 9% gaps with a total of 
2.297 Gbp of sequence. There were a total of 
93,857 gaps among the 1637 scaffolds >100 
kbp. The average scaffold, size was 1.5 Mbp, 
the average contig size was 24.06 kbp, arid the , 
average gap size was 2.43 kbp, where the dis- 



' tribution of each was essentially exponential. 
More than 50% of all gaps were less than 500 
bp long, >62% of all gaps were less than 1 kbp 
long, and no gap was >100 kbp long. Similar- 
ly, more than 65% of the sequence is in contigs 
>30 kbp, more than 31% is in contigs >100 
kbp, and the largest contig was 1.22 Mbp long. 
Table 3 gives detailed .summary statistics for 
the structure of this assembly with a direct 
comparison to the compartmentalized shotgun 
assembly. 

2.4 Compartmentalized shotgun 
assembly 

In addition to the WGA approach, we pur- 
sued a localized assembly approach that was 
intended to subdivide the genome into seg- 
ments, each of. which could be shotgun as- 
sembled individually. We expected that this 
■ would help in resolution of large interchro- 
mosomal duplications and improve the statis- 
tics for calculating U-unitigs. The compart- 
mentalized assembly process involved clus- 
tering Celera reads and bactigs into large, 
multiple megabase regions of the genome, 
and then running the WGA-assembler on the 
Celera data and shredded, faux reads ob- 
tained from the bactig data. 

The first phase of the CSA strategy was to 
separate Celera reads . into those that matched 
the BAC contigs for a particular PFP BAC 
entry, and those that did not match any public 
data. Such matches, must be guaranteed to 



Table 3. Scaffold statistics for whole-genome and compartmentalized shotgun assemblies. 



No. of bp in scaffolds 

(including intrascaffold gaps) 
No. of bp in contigs 
No. of scaffolds 
No. of contigs 
No. of gaps 
No. of gaps S1 kbp 
Average scaffold size (bp) 
Average contig size (bp) 
Average intrascaffold gap size 

(bp) 

Largest contig (bp) 
% of total contigs 

No. of bp in scaffolds 

(including intrascaffold gaps) 
No. of bp in contigs 
No. of scaffolds 
No. of contigs 
No. of gaps 
No. of gaps si kbp 
Average scaffold size (bp) 
Average contig size (bp) 
Average intrascaffold gap size 

(bp) 

Largest contig (bp) 
% of total contigs 



Scaffold size 


All 


>30 kbp 


>100 kbp 


>500 kbp 


>1000 kbp 


2,905.568,203 


Compartmentalized shotgun assembly 

2.748,892,430 2,700,489,906 


2,489,357,260 


2,248.689,128 


2,653,979,733 
53,591 
170,033 
116,442 
72.091 
54,217 
15,609 
2,161 


2,524,251,302 
2.845 
112,207 
109,362 
69,175 
966,219 
22,496 
2,054 


2,491,538,372 
1,935 
107,199 
105,264 
67,289 
1,395,602 
23.242 
1,985 


2,320,648,201 
1,060 
93,138 
92,078 
59,915 
2,348,450 
24,916 
1,832 


2,106,521,902 
721 
82,009 
81,288 
53,354 
3,118,848 
25,686 
1,749 


1,988,321 
100 

2,847,890,390 


1,988.321 
95 

Whole-genome assembly 
2,574,792,618 


1,988,321 
94 

2,525,334,447 


1,988,321 
87 

2,328,535,466 


1,988,321 
79 

2,140,943.032 


2,586.634,108 
118,968 
221,036 
102,068 
62,356 
23,938 
11,702 
2,560 


2,334,343,339 
2,507 
99,189 
96.682 
60,343 
1,027,041 
23,534 
2,487 


2.297,678,935 
1.637 
95,494 
93,857 
59,156 
1,542.660 
24,061 
2,426 


2,143,002.184 
818 
84,641 
83,823 
54,079 
2,846,620 
25,319 
2,213 


1,983,305,432 
554 
76,285 
75,731 
49,592 
3,864,518 
25,999 
2,082 


1,224,073 
100 


1.224.073 
90 


1.224,073 
89 


1,224.073 
83 


1,224,073 
77 
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properly place a Celera read, so all reads were 
* first masked against a library of common 
repetitive elements, and only matches of at 
least 40 bp to unmasked portions of the read 
constituted a hit Of Celera's 27.27 million 
^^k, 20.76 million matched a bactig and 
^Bmer 0.62 million reads, which did not 
have any matches, were nonetheless identi- 
fied as belonging in the region of the bactig's 
BAC because their mate matched the bactig. 
Of the remaining reads, 2.92 million were 
completely screened out and so could not be 
matched, but the other 2.97 million reads had 
unmasked sequence totaling 1.189 Gbp that 
were not found in the GenBank data set. 
Because the Celera data are 5.1 1 X redundant, 
we estimate that 240 Mbp of unique Celera 
sequence is not in the GenBank data set. 

In the next step of the CSA process, a 
combining assembler took the relevant 5X 
Celera reads and bactigs for a BAC entry, and 
produced an assembly of the combined data 
for that locale. These high-quality sequence 
reconstructions were a transient result whose 
utility was simply to provide more reliable 
information for the purposes of their tiling 
into sets of overlapping and adjacent scaffold 
sequences in the next step. In outline, the 
combining assembler first examines the set of 
matching Celera reads to determine if there 
are excessive pileups indicative of un- 
screened repetitive elements. Wherever these 
occur, reads in the repeat region whose mates 

«e not been mapped to consistent positions 
fcsmoved. Then all sets of mate pairs that 
sistently imply the same relative position 
of two bactigs are bundled into a link and 
weighted according to the number of mates in 
the bundle. A "greedy" strategy then attempts 
to order the bactigs by selecting bundles of 
mate-pairs in order of their weight. A selected 
mate-pair bundle can tie together two forma- 
tive scaffolds. It is incorporated to form a 
single scaffold only if it is consistent with the 
majority of links between contigs of the scaf- 
fold. Once scaffolding is complete, gaps are 
filled by the "Stones" strategy described 
above for the WGA assembler. 

The GenBank data for the Phase 1 and 2 
BACs consisted of an average of 19.8 bactigs 
per BAC of average size 8099 bp. Applica- 
tion of the combining assembler resulted in 
individual Celera BAC assemblies being put 
together into an average of 1.83 scaffolds 
(median of 1 scaffold) consisting of an aver- 
age of 8.57 contigs of average size 18,973 bp. 
In addition to defining order and orientation 
of the sequence fragments, there were 57% 
fewer gaps in the combined result. For Phase 
0 data, the average GenBank entry consisted 
of 91.52 reads of average length 784 bp. 
Application of the combining assembler re- 
^jted in an average of 54.8 scaffolds consist- 
^^■f an average of 58.1 contigs of average 
^^873 bp. Basically, some small amount of 



issembly took place, but not enough Celera 
data were matched to truly assemble the 0.5 X 
to IX data set represented by the typical 
Phase 0 BACs. The combining assembler 
was also applied to the Phase 3 BACs for 
SNP identification, confirmation of assem- 
bly, and localization of the Celera reads. The 
phase 0 data suggest that a combined whole- 
genome shotgun data set and 1 X light-shot- 
gun of BACs will not yield good assembly of 
BAC regions; at least 3X light-shotgun of 
each BAC is needed. 

The 5.89 million Celera fragments not 
matching the GenBank data were assembled 
with our whole-genome assembler. The as- 
sembly resulted in a set of scaffolds totaling 
442 Mbp in span and consisting of 326 Mbp 
of sequence. More than 20% of the scaffolds 
were >5 kbp long, and these averaged 63% 
sequence and 27% gaps with a total of 302 
Mbp of sequence. All scaffolds >5 kbp were 
forwarded along with all scaffolds produced 
by the combining ■ assembler to the subse- 
quent tiling phase. 

At this stage, we typically had one or two 
scaffolds for every BAC region constituting 
- at least 95% of the relevant sequence, and a 
collection of disjoint Celera-unique scaffolds. 
The next step in developing the genome com- 
ponents was to determine the order and over- 
lap tiling of these BAC and Celera-unique 
scaffolds across the genome. For this, we 
used Celera's 50-kbp mate-pairs information, 
and BAC-end pairs (1 8) and sequence tagged 
site (STS) markers (44) to provide long- 
range guidance and chromosome separation. 
Given the relatively manageable number of 
scaffolds, we chose not to produce this tiling 
in a fully automated manner, but to compute 
an initial tiling with a good heuristic and then 
use human curators to resolve discrepancies 
or missed join opportunities. To this end, we 
developed a graphical user interface that dis- 
played the graph of tiling overlaps and the 
evidence for . each. A human curator could 
then explore the implication of mapped STS 
data, dot-plots of sequence overlap, and a 
visual display of the mate-pair evidence sup- 
porting a given choice. The result of this 
process was a collection of "components," 
where each component was a tiled set of 
BAC and Celera-unique scaffolds that had 
been curator-approved. The process resulted 
in 3845 components with an estimated span 
of 2.922 Gbp. 

In order to generate the final CSA, we 
assembled each component with the WGA 
algorithm. As was done in the WGA process, 
the bactig data were shredded into a synthetic 
2X shotgun data set in order to give the 
assembler the freedom to independently as- 
semble the data. By using faux reads rather 
than bactigs, the assembly algorithm could 
correct errors in the assembly of bactigs and 
remove chimeric content in a PFP data entry. 



C , ric or contaminating sequence (from 
anotner part of the genome) would not be 
incorporated into the reassembly of the com- 
ponent because it did not belong there. In 
effect, the previous steps in the CSA process 
served only to bring together Celera frag- 
ments and PFP data relevant to a large con- 
tiguous segment of the genome, wherein we 
applied the assembler used for WGA to pro- 
duce an ab initio assembly of the region. 

WGA assembly of the components result- 
ed in a set of scaffolds totaling 2;906 Gbp in 
span and consisting of 2.654 Gbp of se- 
quence. The chaff, or set of reads not incor- 
porated into the assembly, numbered 6.17 
million, or 22%. More than 90.0% of the 
. genome was covered by scaffolds spanning 
>100 kbp long, and these averaged 92.2% 
sequence and 7.8% gaps with a total of 2.492 
Gbp of sequence. There were a total of 
105,264 gaps among the 107,199 contigs that 
belong to the 1940 scaffolds spanning >100 
kbp. The average scaffold size was 1.4 Mbp, 
the average contig size was 23.24 kbp, and 
the average gap size was 2.0 kbp where each 
distribution of sizes was exponential. As 
such, averages tend to be underrepresentative 
of the majority of the data. Figure 5 shows a 
histogram of the bases in scaffolds of various 
. size ranges. Consider also that more than 
49% of all gaps were <500 bp long, more 
than 62% of all gaps were <1 kbp, and all 
gaps are < 100 kbp long. Similarly, more than . 
73% of the sequence is in contigs > 30 kbp, 
more than 49% is in contigs >100 kbp, and 
the largest contig was 1.99 Mbp long. Table 3 
- provides summary statistics for the structure 
of this assembly with a direct comparison to 
the WGA assembly. 

2.5 Comparison of the WGA and CSA 
scaffolds 

Having obtained two assemblies of the hu- 
man genome via independent computational 
processes (WGA and CSA), we compared 
scaffolds from the two assemblies as another 
means of investigating their completeness, 
consistency, and contiguity. From each as- 
sembly, a set of reference scaffolds contain- 
ing at least 1000 fragments (Celera sequenc- 
ing reads or bactig shreds) was obtained; this 
amounted to 2218 WGA scaffolds and 1717 
CSA scaffolds, for a total of 2.087 Gbp and 
2.474 Gbp. The sequence of each reference 
scaffold was compared to the sequence of all 
scaffolds from the other assembly with which 
it shared at least 20 fragments or at least 20% 
of the fragments of the smaller scaffold. For 
each such comparison, all matches of at least 
200 bp with at most 2% mismatch were 
tabulated. 

From this tabulation, we estimated the 
amount of unique sequence in each assembly 
in two ways. The first was to detennine the 
number of bases of each assembly that were 
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not covered by a matching segment in the The CSA assembly was a few percentage . In order to determine the effectiveness of 
other assembly. Some 82.5 Mbp of the WGA points better in terms of coverage and slightly : . the fingerprint maps and GM99 for mapping 
(3.95%) was not covered by the CSA, where- more consistent than the WGA, because it scaffolds, we first examined the reliability of 
as 204.5 Mbp (8.26%) of the CSA was not was in effect performing a few thousand shot- these maps by comparison with large scaf- 
covered by the WGA. This estimate did not gun assemblies of megabase-sized problems, folds. Only 1% of the STS markers on the 10 
require any consistency of the assemblies or whereas the WGA is performing a shotgun largest scaffolds (those >9 Mbp) were 
any uniqueness of the matching segments. . assembly of a gigabase-sized problem. When : mapped, on. a different chromosome on 
Thus, another analysis was conducted in one considers the increase of two-and-a-half GM99. Two percent of the STS markers dis- 
which matches of less than 1 kbp between a orders of magnitude in problem size, the in- agreed in position by more than five frame- 
pair of scaffolds were excluded unless they , formation loss between the two is remarkably . work bins. However, for : the fingerprint 
were confirmed by other matches having a small. Because CSA was logistically easier to maps, a 2% chromosome discrepancy was 
consistent order and orientation. This gives deliver and the better of the two results avail- observed, and on average . 23.8% of BAC 
some measure of consistent coverage: 1.982 able at the time . when downstream analyses locations in. the scaffold sequence disagreed 
Gbp (95.00%) of the WGA is covered by the needed to be begun, all subsequent analysis with fingerprint map placement by more than 
CSA, and 2.1 69 Gbp (87.69%) of the CSA is was performed on this assembly. five BACs. When further examining the 

covered by the WGA by this more stringent " source of discrepancy, it was found that most 

measure. 2.6 Mapping scaffolds to the genome 0 f me discrepancy came from 4 of the 10 

The comparison of WGA to CSA also . The final step in assembling the genome was to scaffolds, indicating this there is variation in 
permitted evaluation of scaffolds for structur- order and orient the scaffolds on the chromo- 1 the quality of either the map or the scaffolds, 
al inconsistencies. We looked for instances in somes. We first grouped scaffolds together on - All four scaffolds were assembled, as well as 
which a large section of a scaffold from one the basis of their order in the components from the other six, as judged by clone coverage 
assembly matched only one scaffold from the CSA. These grouped scaffolds were reordered analysis,* and showed the same low discrep- 
other assembly, but failed to match over the by examining residual mate-pairing data be- ancy rate to GM99, and thus we. concluded 
full length of the overlap implied by the tween the scaffolds. We next mapped the scaf- that the fingerprint map global order in these 
matching segments. An initial set of candi- fold groups onto the chromosome using physi- cases was not reliable. Smaller scaffolds had 
dates was identified automatically, and then cal mapping data. This step depends on having a higher discordance rate with GM99 (4.21% 
each candidate was inspected by hand. From reliable high-resolution map information such of STSs were discordant by more than five 
this process, we identified 31 instances in that each scaffold will overlap multiple mark- framework bins), but a lower discordance rate 
which the assemblies appear to disagree in a ers. There are two genome-wide types of map with the fingerprint maps (11% of BACs 
nonlocal fashion. These cases are being fur- information available: high-density STS maps disagreed with fingerprint maps by more than 
ther evaluated to determine which assembly .» and fingerprint maps of BAC clones developed five BACs). This observation agrees with the 
is in error and why. at Washington University {45). Among the ge- clone coverage analysis {46) that Celera scaf- 

In addition, we evaluated local inconsis- nome-wide. STS maps, GeneMap99 (GM99) fold construction was better supported by 
tencies of order or orientation. The following has the most markers and therefore was most . long-range mate pairs in larger scaffolds than 
results exclude cases in which one contig in useful for mapping scaffolds. The two different in small scaffolds. 

one assembly corresponds to more than one mapping approaches are complementary to one We created two orderings of Celera scaf- 
overlapping contig in the other assembly (as another. The fingerprint maps should have bet- folds on the basis of the markers (BAC or 
long as the order and orientation of the latter ter local order because they were built by com- STS) on these maps. Where the order of 
agrees with the positions they match in the parison of overlapping BAC clones. On the scaffolds agreed between GM99 and the 
former). Most of these small rearrangements other hand, GM99 should have a more reliable .WashU BAC map, we had a high degree of 
involved segments on the order of hundreds long-range order, because the framework mark- . ■ - . confidence that that order was correct; these 
ofbase pairs and rarely >1 kbp. We found a ers were derived from well-validated genetic scaffolds were termed "anchor scaffolds." 
total of 295 kbp (0.0 12%) in the CS A assem- maps. Both types of maps were used as a Only scaffolds with a low overall discrepancy 
blies that were locally inconsistent with the reference for human curation of the compo- rate with both maps were considered anchor 
WGA assemblies, whereas 2.108 Mbp nents that were the input to the regional assem- scaffolds. Scaffolds in GM99 bins were al- 
(0.11%) in the WGA assembly were incon- bly, but they did not determine the order of lowed to permute in their order to match 

WashU ordering, provided they did not vio- 
late their framework orders. Orientation of 
individual scaffolds was detennined by the 
presence of multiple mapped markers with . 
consistent order. Scaffolds with only one 
marker have insufficient information to as- 
sign orientation. We found 70.1% of the ge- 
nome in anchored scaffolds, more than 99% : 
of which are also oriented (Table 4). Because 
GM99 is of lower resolution than the WashU 
map, a number of scaffolds without STS 
matches could be ordered relative to the an- 
chored scaffolds because they included se- 
quence from the same or adjacent BACs on 
the WashU map. On the other hand, because 
of occasional WashU global ordering dis- 
crepancies, a number of scaffolds determined 
to be "unmappable" on the WashU map could 
be ordered relative to the anchored scaffolds 



sistent with the CSA assembly. 



sequences produced by the assembler. 
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Fig. 5. Distribution of scaffold sizes of the CSA For each range of scaffold sizes, the percent of total 
sequence is indicated. 
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with GM99. These scaffolds were termed 
• "ordered scaffolds." We found that 13.9% of 
the assembly could be ordered by these ad- 
ditional methods, and thus 84.0% of the ge- 
j^ne was ordered unambiguously. 
^^Wext, all scaffolds that could be placed, 
^^not ordered, between anchors were as- 
signed to the interval between the anchored 
scaffolds and were deemed to be "bound- 
ed" between them. For example, small scaf- 
folds having STS hits from the same Gene- 
Map bin or hitting the same B AC cannot be 
ordered relative to each other, but can be 
assigned a placement boundary relative to 
other anchored or ordered scaffolds. The 
remaining scaffolds either had no localiza- 
tion information, conflicting information, 
or could only be assigned to a generic 
chromosome location. Using the above ap- 
proaches, ~98% of the genome was an- 
chored, ordered, or bounded. 

Finally, we assigned a location for each 
scaffold placed on the chromosome by 
spreading out the scaffolds per chromosome. 
We assumed that the remaining unmapped 
scaffolds, constituting 2% of the genome, 
were distributed evenly across the genome. 
By dividing the sum of unmapped scaffold 
lengths with the sum of the number of 
mapped scaffolds, we arrived at an estimate 
of interscaffold gap of 1483 bp. This gap was 
used to separate all the scaffolds on each 
chromosome and to assign an offset in the 
d^mosome. 

^^wring the scaffold-mapping effort, we en- 
wBered many problems that resulted in addi- 
tional quality assessment and validation analy- 
sis. At least 978 (3% of 33,173) BACs were 
believed to have sequence data from more than 
one location in the genome (47). This is con- 
sistent with the bactig chimerism analysis re- 
ported above in the Assembly Strategies sec- 
tion. These BACs could not be assigned to 
unique positions within the CSA assembly and 
thus could not be used for ordering scaffolds. 
Likewise, it was not always possible to assign 
STSs to unique locations in the assembly be- 
cause of genome duplications, repetitive ele- 
ments, and pseudogenes. 

Because of the time required for an ex- 
haustive search for a perfect overlap, CSA 
generated 21,607 intrascaffold gaps where 
the mate-pair data suggested that the contigs 
should overlap, but no overlap was found. 
These gaps were defined as a fixed 50 bp in 
length and make up 18.6% of the total 
1 16,442 gaps in the CSA assembly. 

We chose not to use the order of exons 
implied in cDNA or EST data as a way of 
ordering scaffolds. The rationale for not us- 
ing this data was that doing so would have 
biased certain regions of the assembly by 
rMMMging scaffolds to fit the transcript data 
^^Hde validation of both the assembly and 
geWaefinition processes more difficult. 
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7 Assembly and validati n analysis 

We analyzed the assembly of the genome 
from the perspectives of completeness 
(amount of coverage of the genome) and 
correctness (the structural accuracy of the 
order and orientation and the consensus se- 
quence of the assembly). 

Completeness. Completeness is defined as 
the percentage of the euchromatic sequence 
represented in the assembly. This cannot be 
known with absolute certainty until the eu- 
chromatin , sequence* has been completed. 
However, it is possible to estimate complete- 
ness on the basis of (i) the estimated sizes of 
intrascaffold gaps; (ii) coverage of the two 
published chromosomes, 21 and 22 (48, 49); 
and (iii) analysis of the percentage of an 
independent set of random sequences (STS 
markers) contained in the assembly. The 
whole-genome libraries contain heterochro- 
matic sequence and, although no attempt has 
been made to assemble it, there may be in- 
stances of unique sequence embedded in re- 
gions of heterochromatin as were observed in 
Drosophila (50, 51). 

The sequences of human chromosomes 21 
and 22 have been completed to high quality 
and published (48, 49). Although this se- 
quence served as input to the assembler, the 
finished sequence was shredded into a shot- 
gun data set so that the assembler had the . 
opportunity to assemble it differently from 
the original sequence in the case of structural 
: polymorphisms or assembly errors in the 
BAG data. In particular, the assembler must 
be able to resolve repetitive elements at the 
scale of components (generally multimega- 
base in size), and so this comparison reveals 
the level to which the assembler resolves 
repeats. In certain areas, the assembly struc- 
ture differs from the published versions of 
chromosomes 21 and 22 (see below). The 
consequence of the flexibility to assemble 
"finished" sequence differently on the basis 
of Celera data resulted in an assembly with 
more segments than the chromosome 21 and 
22 sequences. We examined the reasons why 
there are more gaps in the Celera sequence 
than in chromosomes 21 and 22 and expect 
that they may be typical of gaps in other 
regions of the genome. In the Celera assem- 
bly, there are 25 scaffolds, each containing at 
least 10 kb of sequence, that collectively span 
94.3% of chromosome 21. Sixty-two scaf- 
folds span 95.7% of chromosome 22. The 
total length of the gaps remaining in the 
Celera assembly for these two chromosomes 
is 3.4 Mbp. These gap sequences were ana- 
lyzed by RepeatMasker and by searching 
against the entire genome assembly (52). 
About 50% of the gap sequence consisted of 
common repetitive elements identified by Re- 
peatMasker; more than half of the remainder 
was lower copy number repeat elements. 
A more global way of assessing complete- 



ness / j measure the content of an independent 
set of sequence data in the assembly. We com- 
pared 48,938 STS markers from Genemap99 
(51) to the scaffolds. Because these markers 
were not used in the assembly processes, they 
provided a truly independent measure of com- 
: pleteness. ePCR (53) and BLAST (54) were 
used to locate STSs on the assembled genome. 
We found 44,524 (91%) of the STSs in the 
mapped genome. An additional 2648 markers 
. (5.4%) were found by searching the unas- 
sembled data or "chaff." We identified 1283 
STS markers (2.6%) not found in either Celera 
sequence or BAC data as of September 2000, 
raising the possibility that these markers may 
not be of human origin. If that were the case, 
the Celera assembled sequence would represent 
93.4% of the human genome and the unas- 
sembled data 5.5%, for a total of 98.9% cover- 
age. Similarly, we compared CSA against 
36,678 TNG radiation hybrid markers (55a) 
using the same method. We found that 32,371 
markers (88%) were located in the mapped 
CSA scaffolds, with 2055 markers (5.6%) 
found in the remainder. This gave a 94% cov- 
erage of the genome through another genome- 
wide survey. 

Correctness. Correctness is defined as the 
structural and sequence accuracy of the as- 
sembly. Because the source sequences for the 
Celera data and the GenBank data are from 
different individual's, we could not directly 
compare the consensus sequence of the as- 



Table 4. Summary of scaffold mapping. Scaffolds 
\ were mapped to the genome with different levels 
of confidence (anchored scaffolds have the highest 
confidence; unmapped scaffolds have the lowest). 
Anchored scaffolds were consistently ordered by 
the WashU BAC map and CM99. Ordered scaf- 
folds were consistently ordered by at least one of 
the following: the WashU BAC map, GM99, or 
component tiling path. Bounded scaffolds had or- 
der conflicts between at least two of the external 
maps, but their , placements were adjacent to a 
neighboring anchored or ordered scaffold. Un- 
mapped scaffolds had, at most, a chromosome 
assignment The scaffold subcategories are given 
below each category. 
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sembly against other finished sequence for 
determining sequencing accuracy at the nu- 
cleotide level, although this has been done for 
identifying polymorphisms as described in 
Section 6. The accuracy of the consensus 
sequence is at least 99.96% on the basis of a 
statistical estimate derived from the quality 
values of the underlying reads. 

The structural consistency of the assembly 
can be measured by mate-pair analysis. In a 
correct assembly, every mated pair of se- 
quencing reads should be located on the con- 
sensus sequence with the correct separation 
and orientation between the pairs. A pair is 
termed 'Valid" when the reads are in the 
correct orientation, and the distance between 
them is within the mean ± 3 standard devi- 
ations of the distribution of insert sizes of the 
library from which the pair was sampled. A 
pair is termed "misoriented" when the reads 
are not correctly oriented, and is termed "mis- 
separated" when the distance between the 
reads is not in the correct range but the reads 
are correctly oriented. The mean ± the stan- 
dard deviation of each library used by the 
assembler was determined as described 
above. To validate these, we examined all 
reads mapped to the finished sequence of 
chromosome 21 (48) and determined how 
many incorrect mate pairs there were as a 
result of laboratory tracking errors and chi- 
merism (two different segments of the ge- 
nbme cloned into the same plasmid), and how 
tight the distribution of insert sizes was for 
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those that were correct (Table 5). The stan- 
dard deviations for all Celera libraries were 
quite small, less than 15% of the insert 
length, with the exception of a few 50-kbp 
libraries. The 2- and 10-kbp libraries con- 
tained less than 2% invalid mate pairs, where- 
as the 50-kbp libraries were somewhat higher 
(—10%). Thus, although the mate-pair infor- 
, mation was not perfect, its accuracy was such 
that measuring valid, misoriented, and mis- 
separated pairs with respect to a given assem- 
bly was deemed to be a reliable instrument 
for validation purposes, especially when sev- 
eral mate pairs confirm or deny an ordering. 

The clone coverage of the genome was 
39 X, meaning that any given base pair was, 
on average, contained in 39 clones or, equiv- 
alently, spanned, by 39 mate-paired reads. 
Areas of low clone coverage or areas with a 
high proportion of invalid mate pairs would 
indicate potential assembly problems. We 
computed the coverage of each base in the 
assembly by valid mate pairs (Table 6). In . 
summary, for scaffolds >30 kbp in length, 
less than 1% of the Celera assembly was in 
regions of less than 3 X clone coverage. Thus, 
more than 99%. of the assembly, including 
order and orientation, is strongly supported 
by this measure alone. 

. We examined the locations and number of 
all misoriented and misseparated mates. In 
addition to doing this analysis on the CSA 
assembly (as of 1 October 2000), we also . 
performed a study of the PFP assembly as of 



5 September 2000 (30 f 55b). In this latter 
case, Celera mate pairs had to be mapped to 
the PFP assembly. To avoid mapping errors 
due to high-fidelity repeats, the only pairs 
mapped were those for which both reads 
matched at only one location with less than 
6% differences. A threshold was set such that 
sets of five or more simultaneously invalid 
mate pairs indicated a potential breakpoint, 
where the construction of the two assemblies 
differed. The graphic comparison of the CSA 
chromosome 21 assembly with the published 
sequence (Fig. 6A) serves as a validation of 
this methodology. Blue tick marks in the 
panels indicate breakpoints. There were a 
similar (small) number of breakpoints on 
both chromosome sequences. The exception 
was 12 sets of scaffolds in the Celera assem- 
bly (a total of 3% of the chromosome length 
in 212 single-contig scaffolds) that were 
mapped to the wrong positions because they 
were too small to be mapped reliably. Figures 
6 and 7 and Table . 6 illustrate , the mate-pair 
differences and breakpoints between the two 
assemblies. There was a higher percentage of 
misoriented and misseparated mate pairs, in 
the large-insert libraries (50 kbp and BAC 
ends) than in the small-insert libraries in both 
assemblies (Table 6). The large-insert librar- 
ies are more likely to identify, discrepancies 
simply because they span a larger segment of 
the genome. The graphic comparison be- 
tween the two assemblies for chromosome 8 
(Fig. 6, B and C) shows that there are many 



Table 5. Mate-pair validation. Celera fragment sequences were mapped to 
the published sequence of chromosome 21. Each mate pair uniquely 
mapped was evaluated for correct orientation and placement (number 



of mate pairs tested). If the two mates had incorrect relative orienta- 
tion or placement, they were considered invalid (number of invalid mate 
pairs). 
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gene boundaries. During this process, multiple 
hits to the same region were collapsed to a 
coherent set of data by tracking the coverage of 
a region. For example, if a group of bases was 
represented by multiple overlapping ESTs, the 
union of these regions matched by the set of 
ESTs on the scaffold was marked as being 
supported by EST evidence. This resulted in a 
series of "gene bins," each of which was be- 
lieved to contain a single gene. One weakness of 
this initial implementation of the algorithm was 
in predicting gene boundaries in regions of tan- 
demly duplicated genes. Gene clusters frequent- 
ly resulted in homologous neighboring genes 
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being joined together, resulting in an annotation 
that artificially concatenated these gene models. 

Next, known genes (those with exact match- 
es of a full-length cDNA sequence to the ge- 
nome) were identified, and the region corre- 
sponding to the cDNA was annotated as a 
predicted transcript. A subset of the curat- 
ed human gene set RefSeq from the Nation- 
al Center for Biotechnology Information 
(NCBI) was included as a data set searched in 
the computational pipeline. If a RefSeq tran- 
script matched the genome assembly for at least 
50% of its length at >92% identity, then the 
SIM4 (63) alignment of the RefSeq transcript to 



the region of the genome under analysis was 
promoted to the status of an Otto annotation. 
Because the genome sequence has gaps and 
sequence errors such as frameshifts, it was not 
always possible to predict a transcript that 
agrees precisely with the experimentally deter- 
mined cDNA sequence. A total of 6538 genes 
in our inventory were identified and transcripts 
predicted in this way. 

Regions that have a substantial amount of 
sequence similarity, but do not match known 
genes, were analyzed by that part of the Otto 
system that uses the sequence similarity in- 
formation to predict a transcript. Here, Otto 
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Fig. 6. Comparison of the CSA and the PFP assembly. 
(A) All of chromosome 21, (B) all of chromosome 8, 
and (C) a 1-Mb region of chromosome 8 representing 
a single Celera scaffold. To generate the figure, Celera 
fragment sequences were mapped onto each assem- 
bly. The PFP assembly is indicated in the upper third 
of each panel; the Celera assembly is indicated in the 
lower third. In the center of the panel, green lines 
show Celera sequences that are in the same order and 
orientation in both assemblies and form the longest 
consistently ordered run of sequences. Yellow lines 
indicate sequence blocks that are in the same orien- 
tation, but out of order. Red lines indicate sequence 
blocks that are not in the same orientation. For 
clarity, in the latter two cases, lines are only drawn 
between segments of matching sequence that are at 
least 50 kbp long. The top and bottom thirds of each 
panel show the extent of Celera mate-pair violations 
(red, misoriented; yellow, incorrect distance between 
the mates) for each assembly grouped by library size. 
(Mate pairs that are within the correct distance, as 
expected from the mean library insert size, are omit- 
ted from the figure for clarity.) Predicted breakpoints, 
corresponding to stacks of violated mate pairs of the 
same type, are shown as blue ticks on each assembly 
axis. Runs of more than 10,000 Ns are shown as cyan 
bars. Plots of all 24 chromosomes can be seen in Web 
fig. 3 on Science Online at www.sciencemag.ore/cgi/ 
content/full/291/5507/1304/DC1. 
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evaluates evidence generated by the compu- 
tational pipeline, corresponding to conserva- 
tion between mouse and human genomic 
| DNA » similarity to human transcripts (ESTs 
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and cDNAs), similarity to rodent transcripts 
(ESTs and cDNAs), and similarity of the 
translation of human genomic DNA to known 
proteins to predict potential genes in the hu- 



man genome. The sequence from the region 
of genomic DNA contained in a gene bin was 
extracted, and the subsequences supported by 
any homology evidence were marked (plus 100 
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bases flanking these regions). The other bases 
in the region, those not covered by any homol- 
ogy evidence, were replaced by N's. This se- 
quence segment, with high confidence regions 
represented by the consensus genomic se- 
quence and the remainder represented by N's, 
was then evaluated by Genscan to see if a 
consistent gene model could be generated. This 
procedure simplified the gene-prediction task 
by first establishing the boundary for the gene 
(not a strength of most gene-finding algo- 
rithms), and by eliminating regions with no 
supporting evidence. If Genscan returned a 
plausible gene model, it was further evaluated 
before being promoted to an "Otto" annotation. 
The final Genscan predictions were often quite 
different from the prediction that Genscan re- 
turned on the same region of native genomic 
sequence. A weakness of using Genscan to 
refine the gene model is the loss of valid, small 
exons from the final annotation. 

The next step in defining gene structures 
based on sequence similarity was to compare 
each predicted transcript with the homology- 
based evidence that was used in previous steps 
to evaluate the depth of evidence for each exon 
in the prediction. Internal exons were consid- 
ered to be supported if they were covered by 
homology evidence to within ±10 bases of 
their edges. For first and last exons, the internal 
edge was required to be within 1 0 bases, but the 
external edge was allowed greater latitude to 
allow for 5' and 3' untranslated regions 
(UTRs). To be retained, a prediction for a 
multi-exon gene must have evidence such that 
the total number of 'Tuts," as defined above, 
divided by the number of exons in the predic- 
tion must be >0.66 or must correspond to a 
RefSeq sequence. A single-exon gene must be 
covered by at least three supporting hits (±10 
bases on each side), and these must cover the 
complete predicted open reading frame. For 
a single-exon gene, we also required that 
the Genscan prediction include both a start 
and a stop codon. Gene models that did not 
meet these criteria were disregarded, and 



Table 7. Sensitivity and specificity of Otto and 
Genscan. Sensitivity and specificity were calculat- 
ed by first aligning the prediction to the published 
RefSeq transcript, tallying the number (A/) of 
uniquely aligned RefSeq bases. Sensitivity is the 
ratio of N to the length of the published RefSeq 
transcript Specificity is the ratio of N to the 
length of the prediction. All differences are signif- 
icant (Tukey HSD; P < 0.001). 



Method 


Sensitivity 


Specificity 


Otto (RefSeq only)* 


0.939 


0.973 


Otto (homology) f 


0.604 


0.884 


Genscan 


0.501 


0.633 



•Refers to those annotations produced by Otto using only 
the Sim4-polished RefSeq alignment rather than an evi- 
dence-based Genscan prediction. fRefers to those 
annotations produced by supplying all available evidence 
to Genscan. 
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those that passed were promoted to Otto 
. predictions. Homology-based Otto predic- 
tions do not contain 3' and 5' untranslated 
sequence. Although three de novo gene-finding 
programs [GRAIL, Genscan, and FgenesH 
(63)] were run as part of the computational 
analysis, the results of these programs were not 
directly used in making the Otto predictions. 
. Otto predicted 11,226 additional genes by 
means of sequence similarity. 

3.2 Otto validation 

To validate the Otto homology-based process 
and the method that Otto uses to define the 
structures of known genes, we compared tran- 
scripts predicted by Otto with their correspond- 
ing (and presumably correct) transcript from a 
set of 4512 RefSeq transcripts for which there 
was a unique SLM4 alignment (Table 7). In 
order to evaluate the relative performance of 
. Otto and Genscan, we made three comparisons. 
The first involved a determination of the accu- 
racy of gene models predicted by Otto with 
only homology data other than the correspond- 
ing RefSeq sequence (Otto homology in Table 
7). We measured the sensitivity (correctly pre- 
. dieted bases divided by the total length of the 
cDNA) and specificity (correctly predicted 
bases divided by the sum of the correctly and 
incorrectly predicted bases). Second, we exam- 
ined the sensitivity and specificity of the Otto 
predictions that were made solely with the Ref- 
Seq sequence, which is the process that Otto, 
uses to annotate known genes (Otto-RefSeq). 
And third, we determined the accuracy of the 
Genscan predictions corresponding to these 
RefSeq sequences. As expected, the alignment 
method (Otto-RefSeq) was the most accurate, 
and Otto-homology performed better than Gen- 
scan by both criteria. Thus, 6. 1% of true RefSeq 
nucleotides were not represented in the Otto- 
refseq annotations and 2.7% of the nucleotides 
in the Otto-RefSeq transcripts were not con- 
tained in the original RefSeq transcripts. The 
discrepancies could come from legitimate 
differences between the Celera assembly 
and the RefSeq transcript due to polymor- 
phisms, incomplete or incorrect data in the 
Celera assembly, errors introduced by Sim4 
during the alignment process, or the pres- 
ence of alternatively spliced forms in the 
data set used for the comparisons. 

Because Otto uses an evidence-based ap- 
proach to reconstruct genes, the absence of 
experimental evidence for intervening exons 
may inadvertantly result in a set of exons that 
cannot be spliced together to give rise to a 
transcript In such cases, Otto may "split genes" 
when in fact all the evidence should be com- 
bined into a single transcript. We also examined 
the tendency of these methods to incorrectly 
split gene predictions. These trends are shown 
in Fig. 8. Both RefSeq and homology-based 
predictions by Otto split known genes into few- 
er segments than Genscan alone. 



3.3 Gene number 

Recognizing that the Otto system is quite 
conservative, we used a different gene-pre- 
diction strategy in regions where the ho- 
mology evidence was less strong. Here the 
results of de novo gene predictions were 
used. For these genes, we insisted that a 
predicted transcript have at least two of the 
following types of evidence to be included 
in the gene set for further analysis: protein, 
human EST, rodent EST, or mouse genome 
fragment matches. This final class of pre- 
dicted genes is a subset of the predictions 
made by the three gene-finding programs 
that were used in the computational pipe- 
line. For these, there was not sufficient 
. sequence similarity information for Otto to 
attempt to predict a gene structure. The 
. three de novo gene-finding programs re- 
sulted in about 155,695 predictions, of 
which — 76,4 10 were nonredundant (non- 
overlapping with one another). Of these, 
57,935 did not overlap known genes or 
predictions made by Otto. Only 21,350 of 
the gene predictions that did not overlap 
Otto predictions were partially supported 
by at least one type of sequence similarity 
evidence, and 8619 were partially support- 
ed by two types of evidence (Table 8). 

The sum of this number (21,350) and the 
number of Otto annotations (17,764), 39,1 14, 
is near the upper limit for the human gene 
complement. As seen in Table 8, if the re- 
quirement for other supporting evidence is 
made more stringent, this number drops rap- 
idly so that demanding two types of evidence 
reduces the total gene number to 26,383 and 
demanding three types reduces it to —23,000. 
Requiring that a prediction be supported by 
all four categories of evidence is too stringent 
, because it would eliminate genes that encode 
novel proteins (members of currently unde- 
scribed protein families). No correction for 
pseudogenes has been made at this point in 
the analysis. 

In a further attempt to identify genes that 
were not found by the autoannotation process 
or any of the de novo gene finders, we ex- 
amined regions outside of gene predictions 
that were similar to the EST sequence, and 
where the EST matched the genomic se- 
quence across a splice junction. After correct- 
ing for potential 3' UTRs of predicted genes, 
about 2500 such regions remained. Addition 
of a requirement for at least one of the fol- 
lowing evidence types — homology to mouse 
genomic sequence fragments, rodent ESTs, 
or cDNAs — or similarity to a known protein 
reduced this number to 1010. Adding this to 
the numbers from the previous paragraph 
would give us estimates of about 40,000, 
27,000, and 24,000 potential genes in the 
human genome, depending on the stringency 
of evidence considered. Table 8 illustrates the 
number of genes and presents the degree of 
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confidence based on the supporting evidence. 
Transcripts encoded by a set of 26,383 genes 
were assembled for further analysis. This set 
includes the 6538 genes predicted by Otto on 
I the basis of matches to known genes, 1 1,226 
transcripts predicted by Otto based on homol- 
ogy evidence, and 8619 from the subset of 
transcripts from de novo gene-prediction pro- 
grams that have two types of supporting ev- 
idence. The 26,383 genes are illustrated along 
chromosome diagrams in Fig. 1. These are a 
very preliminary set of annotations arid are 
subject to all the limitations of an automated, 
process. Considerable refinement is still nec- 
essary to improve the accuracy of these tran- 
script predictions. All the predictions and 
descriptions of genes and the associated evi- 
dence that we present are the product of 
completely computational processes, not ex- 
pert curation. We have attempted to enumer- 
ate the genes in the human genome in such a 
way that we have different levels of confi- 
dence based on the amount of supporting 
evidence: known genes, genes with good pro- 
tein or EST homology evidence, and de novo 
gene predictions confirmed by modest ho- 
mology evidence. 



I c; 



3.4 Features of human gene 
transcripts 

We estimate the average span for a "typi- 
cal" gene in the human DNA sequence to 
e about 27,894 bases. This is based on the 
verage span covered by RefSeq tran- 
scripts, used because it represents our high- 
est confidence set. 

The set of transcripts promoted to gene 
annotations varies in a number of ways. As 
can be seen from Table 8 and Fig. 9, tran- 
scripts predicted by Otto tend to be longer, 
having on average about 7.8 exons, whereas 
those promoted from gene-prediction pro- 
grams average about 3.7 exons. The largest 
number of exons that we have identified in a 
transcript is 234 in the titin mRNA. Table 8 
compares the amounts of evidence that sup- 
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port the Otto and other predicted transcripts. 
For example, one can see that a typical Otto 
transcript has 6.99 of its 7.81 exons supported 
by protein homology evidence. As would be 
expected, the Otto transcripts generally have 
more support than do transcripts predicted by 
the de novo methods. 

4 Genome Structure 

Summary. This section describes several of 
the rioncoding attributes of the assembled 
genome sequence and their correlations with 
the predicted gene set. These include an anal- 
ysis of G+C content and gene density in the 
context of cytogenetic maps of the genome, 
an enumerative analysis of CpG islands, and 
a brief description of the genome-wide repet- 
itive elements. 



4.1 Cytog netic maps 

Perhaps the most obvious, and certainly the 
. most visible, element of the structure of 
the genome is the banding pattern produced 
, by Giemsa stain. Chromosomal banding 
studies have revealed that about 17% to 
20% of the human chromosome comple- 
ment consists of C-bands, or constitutive 
heterochromatin {64). Much of this hetero- 
chromatin is highly polymorphic and con- 
sists of different families of alpha satellite 
DNAs with various higher order repeat 
structures (65). Many chromosomes have 
complex inter- and intrachromosomal du- 
plications present in pericentromeric re- 
gions (66). About 5% of the sequence reads 
were identified as alpha satellite sequences; 
these were not included in the assembly. 
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Fig. 8. Analysis of split genes resulting from different annotation methods. A set of 4512 
Sim4-based alignments of RefSeq transcripts to the genomic assembly were chosen (see the text 
for criteria), and the numbers of overlapping Genscan, Otto (RefSeq only) annotations based solely 
on Sim4-polished RefSeq alignments, and Otto (homology) annotations (annotations produced by 
supplying all available evidence to Genscan) were tallied. These data .show the degree to which 
multiple Genscan predictions and/or Otto annotations were associated with a single RefSeq 
transcript The zero class for the Otto-homology predictions shown here indicates that the 
Otto-homology calls were made without recourse to the RefSeq transcript, and thus no Otto call 
was made because of insufficient evidence. 



Table 8. Numbers of exons and transcripts supported by various types of evidence for Otto and de novo gene prediction methods. Highlighted cells indicate 
the gene sets analyzed in this paper (boldface, set of genes selected for protein analysis; italic, total set of accepted de novo predictions). ♦ 
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No. of lines of evidence* 






Total 




















Mouse 


Rodent 


Protein 


Human 


si 


>2 


S3 


>4 


Number of 


17,969 


17,065 


14,881 


15,477 


16,374 


17,968f 


17,501 


15.877 


12,451 


transcripts 












Number of 


141,218 


111,174 


89,569 


108,431 


118,869 


140,710 


127,955 


99,574 


59,804 


exons 
















Number of 


58.032 


14,463 


5,094 


8,043 


9,220 


21.3S0 


8,619 


4,947 


1,904 


transcripts . 














Number of 


319,935 


48,594 


19,344 


26,264 


40,104 


79,148 


31,130 


17,508 


6,520 


exons 














Otto 


7.84 


5.77 


6.01 


6.99 


7.24 


7.81 


7.19 


6.00 


4.28 


De novo 


5.53 


3.17 


3.80 


3.27 


4.36 


3.7 


3.56 


3.42 


3.16 



Otto 



De novo 



^No. of i 
^^Ktransi 

^^ourkir 



exons per 
;cript 



our kinds of evidence (conservation in 3X mouse genomic DNA, similarity to human EST or cDNA. similarity to rodent EST or cDNA, and similarity to known proteins) were 
considered to support gene predictions from the different methods. The use of evidence is quite liberal, requiring only a partial match to a single exon of predicted transcript tThis 
number includes alternative splice forms of the 1 7.764 genes mentioned elsewhere in the text 
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Examination of pericentromeric regions is 
ongoing. 

The remaining -80% of the genome, the 
euchromatic component, is divisible into G-, 
R-, and T-bands (67). These cytogenetic bands 
have been presumed to differ in their nucleotide 
composition and gene density, although we 
. have been unable to determine precise band 
boundaries at the molecular level. T-bands are 
the most G+C- and gene-rich, and G-bands are 
G+C-poor (68). Bemardi has also offered a 
description of the euchromatin at the molecular 
level as long stretches of DNA of differing base 
composition, termed isochores (denoted L, HI, 
H2, and H3), which are >300 kbp in length 
(69). Bemardi defined the L (light) isochores as 
G+C-poor (<43%), whereas the H (heavy) 
isochores fall into three G+C-rich classes rep- 
resenting 24, 8, and 5% of the genome. Gene 
concentration has been claimed to be very low 
. in the L isochores and 20-fold more enriched in 
the H2 and H3 isochores (70), By examining 
contiguous 50-kbp windows of G+C content 
across the assembly, we found that regions of 
G+C content >48% (H3 isochores) averaged 
273.9 kbp in length, those with G+C content 
between 43 and 48% (HI +H2 isochores) aver- 
aged 202.8 kbp in length, and the average span 
of regions with <43% (L isochores) was 
1078.6 kbp. The correlation between G+C 
content and gene density was also examined in 
50-kbp windows along the assembled sequence 
(Table 9 and Figs. 10 and 11). We found that 

the density of genes was greater in regions of 
high G+C than in regions of low G+C content, 
as expected However, the correlation between 
G+C content and gene density was not as 
skewed as previously predicted (69). A higher 
proportion of genes were located in the G+C- 
poor regions than had been expected. 

Chromosomes 17, 19, and 22, which have 
a disproportionate number of H3-containing 
bands, had the highest gene density (Table 
10). Conversely, of the chromosomes that we 
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found to have the lowest gene density, X, 4, 
1 8, 13, and Y, also have the fewest H3 bands! 
Chromosome 15, which also has few H3 
bands, did not have a particularly low gene 
density in our analysis. In addition, chromo- 
some 8, which we found to have a low gene 
density, does not appear to be unusual in its 
H3 banding. 

How valid is. Ohno's postulate (71) that 
.mammalian genomes consist of oases of genes 
in otherwise essentially empty deserts? It ap- 
pears that the.human genome does indeed con- 
tain deserts, or large, gene-poor regions. If we 
define a desert as a region >500 kbp without a 
gene, then we see that 605 Mbp, or about 20% 
of the genome, is in deserts. These are not 
uniformly distributed over the various chromo- 
somes. Gene-rich chromosomes 17, 19, and 22 
have only about 12% of their collective 171 
Mbp in deserts, whereas gene-poor chromo- 
somes^ 13, 18,andXhave27.5%oftheir492 
Mbp in deserts (Table 1 1). The apparent lack of 
predicted genes in these regions does not nec- 
essarily imply that they are devoid of biological 
function. 

4.2 Linkage map 

Linkage maps provide the basis for genetic 
analysis and are widely used in the study of the 
inheritance of traits and in the positional clon- 
ing of genes. The distance metric, centimorgans 
(cM), is based on the recombination rate be- 
tween homologous chromosomes during meio- 



sis. In general, the rate of recombination in 
females is greater than that in males, and this 
degree of map expansion is not uniform across 
the genome (72). One of the opportunities en- 
abled by a nearly complete genome sequence is 
to. produce the ultimate physical map, and to 
fully analyze its correspondence with two other 
maps that have been widely used in genome 
and genetic analysis: the linkage map and the 
•cytogenetic map. This would close the loop 
between the mapping and sequencing phases of 
the genome project 

We mapped the location of the markers 
that constitute the Genethon linkage map to 
the genome. The rate of recombination, ex- 
pressed as cM per Mbp, was calculated for 
3-Mbp windows as shown in Table 12. High- 
er rates of recombination in the telomeric 
region of the chromosomes have been previ- 
ously documented (73). From this mapping 
result, there as a difference of 4.99 between 
lowest rates and highest rates and the largest 
difference of 4.4 between males and females 
(4.99 to 0.47 on chromosome 16). This indi- 
cates that the variability in recombination 
rates among regions of the genome exceeds 
the differences in recombination rates be- 
tween males and females.. The human ge- 
nome has recombination hotspots, where re- 
combination rates vary fivefold or more over 
a space of 1 kbp, so the picture one gets of the 
magnitude of . variability in recombination 
rate will depend on the size of the window 



Table 9. Characteristics of C+C in isochores. 


Isochore 


G+C (%) 


Fraction of 


genome 


Fraction of 


genes 




Predicted* 


Observed 


Predicted* 


Observed 


H3 

H1/H2 
L 


>48 
43-48 
<43 


5 
25 
67 


9.5 
21.2 
69.2 


37 
32 
31 


24.8 
26.6 
48.5 


me predictions were based on Bemardi 


s definitions (70) of the isochore 


structure of the human genon 


fie. 



Fig. 9. Comparison of 
the number of exons 
per transcript between 
the 17,968 Otto tran- 
scripts and 21350 de 
novo transcript predic- 
tions with at least one 
line of evidence that 
do not overlap with an 
Otto prediction. Both 
sets have the highest 
number of transcripts 
in the two-exon cate- 
gory, but the de novo 
gene predictions are 
skewed much more 
►ward smaller tran- 
ipts. In the Otto set 
19.7% of the tran- 
scripts have one or 
two exons, and 5.7% 
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examined Unfortunately, too few mekw 
crossovers have occurred in Centre d'Etude 
du Polymorphism Humain (CEPH) and other 
reference families to provide a resolution any 
finer than about 3 Mbp. The next challenge 
will be to determine a sequence basis of 
recombination at the chromosomal level An 
accurate predictor for the rate for variation in 
recombination rates between any pair of 
markers would be extremely useful in design- 
ing markers to narrow a region of linkage, 
such as in positional cloning projects. 
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4.3 Correlation between CpC islands 
and genes 

CpG islands are stretches of unmethylated 
DNA with a higher frequency of CpG 
dinucleotides when compared with the entire 
genome (74). CpG islands are believed to 
preferentially occur at the transcriptional start 
of genes, and it has been observed that most 
housekeeping genes have CpG islands at the 
5' end of the transcript (75, 76). In addition, 
experimental evidence indicates that CpG is- 
land methylation is correlated with gene in- 
activation (77) and has been shown to be 
important during gene imprinting (78) and 
tissue-specific gene expression (79) 

Experimental methods have been used 
that resulted in an estimate of 30,000 to 
45,000 CpG islands in the human genome 
(74, 80) and an estimate of 499 CpG islands 
on human chromosome 22 (81), Larsen et 
-I (76) and Gardiner-Garden and Frommer 
JS) used a computational method to iden- 
ify CpG islands and defined them as re- 
gions of DNA of >200 bp that have a G+C 
content of >50% and a ratio of observed 



versus expected frequency of CG dinucle- 
otide >0.6. 

It is difficult to make a direct compari- 
son of experimental definitions of CpG is- 
lands with computational definitions be- 
cause computational methods do not con- 
sider the methylation state of cytosine and 
experimental methods do not directly select 
regions of high G+C content. However, we 
can determine the correlation of CpG island 
with gene starts, given a set of. annotated 
genomic transcripts and the whole genome 
sequence. We have analyzed the publicly 
available annotation of chromosome 22, as 
well as using the entire human genome in 
our assembly and the computationally an- 
notated genes. A variation of the CpG is- 
land computation was compared with 
Larsen etal (76). The main differences are 
that we use a sliding window of 200 bp, 
consecutive windows are merged only if 
they, overlap, and we recompute the CpG 
value upon merging, thus rejecting any po- 
tential island if it scores less than the 
threshold. 

To compute various CpG statistics, we 
used two different thresholds of CG dinucle- 
otide likelihood ratio. Besides using the orig- 
inal threshold of 0.6 (method 1), we used a 
higher , threshold of CG dinucleotide likeli- 
hood ratio of 0.8 (method 2), which results in 
the number of CpG islands on chromosome 
22 close to the number of annotated genes on 
this chromosome. The main results are sum- 
marized in Table 13. CpG islands computed 
with method 1 predicted only 2.6% of the 
CSA sequence as CpG, but 40% of the gene 
starts (start codons) are contained inside a 



CpG island. This is comparable to ratios re- 
ported by others (82). The last two rows of 
the table show the observed and expected 
average distance, respectively, of the closest 
CpG island from the first exon. The observed 
average closest CpG islands are smaller than 
the corresponding expected distances, con- 
firming an association between CpG island 
and the first exon. 

We also looked at the distribution of CpG 
island nucleotides among various sequence 

classes such as intergenic regions, introns, 
exons, and first exons. We computed the 
likelihood score for each sequence class as 
the ratio of the observed fraction of CpG 
island nucleotides in that sequence class 
and the expected fraction of CpG island 
nucleotides in that sequence class. The re- 
sult of applying method 1 on CSA were 
scores of 0.89 for intergenic region, 1.2 for 
intron, 5.86 for exon,. and 13.2 for first 
exon. The same trend was also found for 
chromosome 22 and after the application of 
a higher threshold (method 2) on both data 
sets. In sum, genome-wide analysis has 
extended earlier analysis and suggests a 
strong correlation between CpG islands and 
first coding exons. 
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4.4 Genome-wide repetitive elements 

The proportion of the genome covered by 
various classes of repetitive DNA is present- 
ed in Table 14. We observed about 35% of 
the genome in these repeat classes, very sim- 
ilar to values reported previously (83). Repet- 
itive sequence may be underrepresented in 
the Celera assembly as a result of incomplete 
repeat resolution, as discussed above. About 
8% of the scaffold length is in gaps, and we 
expect that much of this is repetitive se- 
quence. Chromosome 19 has the highest re- 
peat density (57%), as well as the highest 
gene density (Table 10). Of interest, among 
the different classes of repeat elements, we 
observe a clear association of Alu elements 
and gene density, which was not observed 
between LINEs and gene density. 

5 Genome Evolution 

Summary. The dynamic nature of genome 
evolution can be captured at several levels. 
These include gene duplications mediated by 
RNA intermediates (retrotransposition) and 
segmental genomic duplications. In this sec- 
tion, we document the genome-wide occur- 
rence of retrotransposition events generating 
functional (intronless paralogs) or inactive 
genes (pseudogenes). Genes involved in 
translational processes and nuclear regulation 
account for nearly 50% of all intronless para- 
logs and processed pseudogenes detected in 
our survey. We have also cataloged the extent 
of segmental genomic duplication and pro- 
vide evidence for 1077 duplicated blocks 
covering 3522 distinct genes. 
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Fig. 11 (continued). Relation among gene density (orange), C+C content 
(green) EST density (blue), and Alu density (pink) along the lengths of 
each of the chromosomes. Gene density was calculated in 1-Mbp win- 



dows. The percent of G+C nucleotides was calculated in 100-kbp 
windows. The number of ESTs and Alu elements is shown per 100-kbp 
window. 



5.1 Retrotransposition in the human 
genome 

Retrotransposition of processed mRNA 
transcripts into the genome results in func- 
tional genes, called intronless paralogs, or 
inactivated genes (pseudogenes). A paralog 
to a gene that appears in more than 
^^P>py in a given organism as a result of 



a duplication event. The existence of both 
intron-containing and intronless forms of 
genes encoding functionally similar or 
identical proteins has been previously de- 
scribed (84, 85). Cataloging these evolu- 
tionary events on the genomic landscape is 
of value in understanding the functional 
consequences of such gene-duplication 



events in cellular biology. Identification of 
conserved intronless paralogs in the mouse 
or other mammalian genomes should pro- 
vide the basis for capturing the evolution- 
ary chronology of these transposition 
events and provide insights into gene loss 
and accretion in the mammalian radiation. 
A set of proteins corresponding to all 901 
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Otto-predicted, single-exon genes were suo 
jected to BLAST analysis against the proteins 
encoded by the remaining multiexon predict- 
ed transcripts. Using homology criteria of 
^K>% sequence identity over 90% of the 
^Bigth, we identified 298 instances of single- 
multi-exon correspondence. Of these 298 
sequences, 97 were represented in the Gen- 
Bank data set of experimentally validated 
full-length genes at the stringency specified 
and were verified by manual inspection. / 
.: We believe, that these 97 cases may rep- 
resent intronless paralogs (see Web table 1 on 
Science Online at www.sciencemag.org/cgi/ 
content/rull/291/5507/1304/DCl) of known 
genes. Most of these are flanked by direct 
repeat sequences, although the precise nature 
of these repeats remains to be determined. All 
of the cases for which we have high confi- 
dence contain polyadenylated [poly(A)] tails 
characteristic of retrotransposition. 

Recent publications describing the phe- 
nomenon of functional intronless paralogs 
speculate that retrotransposition may serve as 
a mechanism used to escape X-chromosomal 
inactivation (84. 86). We do not find a bias 
toward X chromosome origination of these 
retrotransposed genes; rather, the results 
show a random chromosome distribution of 
both the intron-containing and corresponding 
intronless paralogs. We also have found sev- 
eral cases of retrotransposition from a single 
source chromosome to multiple target chro- 

fimes. Interesting examples include the 
transposition of a five exon-^ontaining 
somal protein L21 gene on chromosome 
13 onto chromosomes 1, 3, 4, 7, 10, and 14, 
respectively. The size of the source genes can 
also show variability. The largest example is 
the 31-exon diacylglycerol kinase zeta gene 
on chromosome 11 that has an intronless 
paralog on chromosome 13. Regardless of 
route, retrotransposition with subsequent 
gene changes in coding or noncoding regions 
that lead to different functions or expression 
patterns, represents a key route to providing 
an enhanced functional repertoire in mam- 
mals (87). 

Our preliminary set of retrotransposed in- 
tronless paralogs contains a clear overrepre- . 
sentatiori of genes involved in translatidnal 
processes (40% ribosomal proteins and 10% 
translation elongation factors) and nuclear 
regulation (HMG nonhistone proteins, 4%), 
as well as metabolic and regulatory enzymes.' 
EST matches specific to a subset of intronless 
paralogs suggest expression of these intron- 
less paralogs. Differences in the upstream 
regulatory sequences between the source 
genes and their intronless paralogs could ac- 
count for differences in tissue-specific gene 
e xpress ion. Defining which, if any, of these 
P r |^| d genes are functionally expressed 
ai ^^lated will require further elucidation 
ancfexperimental validation. 



the Human Genome 
5.2 Pseudogenes 

A pseudogene is a nonfunctional copy that is 
very similar to a normal gene but that has 
been altered slightly so that it is not ex- 
Table 11. Genome overview. 



pressed. We developed a method for the pre- 
liminary analysis of processed pseudogenes 
in the human genome as a starting point in 
elucidating the ongoing evolutionary forces 



Size of the genome (including gaps) 

Size of the genome (excluding gaps) 

Longest contig 
: Longest scaffold 

Percent of A+T in the genome 

Percent of G+C in the genome 

Percent of undetermined bases in the genome 

Most GC-rich 50 kb 

Least GC-rich 50 kb 

Percent of genome classified as repeats 

Number of annotated genes 

Percent of annotated genes with unknown function 

Number of genes (hypothetical and annotated) 

Percent of hypothetical and annotated genes with unknown function 

Gene with the most exons 

Average gene size 

Most gene-rich chromosome 

Least gene-rich chromosomes 



Total size of gene deserts (>500 kb with no annotated genes) 
Percent of base pairs spanned by genes 
Percent of base pairs spanned by exons 
Percent of base pairs spanned by introns 
Percent of base pairs in intergenic DNA 

Chromosome with highest proportion of DNA in annotated exons 
Chromosome with lowest proportion of DNA in annotated exons 
Longest intergenic region (between annotated + hypothetical eenes) . 
Rate of SNP variation ' 

^yj» p 5a^& te * e annotated 8ene set <* 383 -ri^po^ca. ; 



2.91 Gbp 
.2.66 Gbp 

1.99 Mbp 

14.4 Mbp 
54 
38 
9 

Chr. 2 (66%) 
Chr. X (25%) 
35 

26,383 
42 

39,114 
59 

Titin (234 exons) 
27 kbp 

Chr. 19 (23 genes/Mb) 
Chr. 13 (5 genes/Mb), 
Chr. Y (5 genes/Mb) 
60S Mbp 
25.5 to 37.8* 
1.1 to 1.4* 

24.4 to 36.4* 

74.5 to 63.6* 
Chr. 19 (9.33) 
Chr. Y (0.36) 
Chr. 13 (3,038,416 bp) 
1/1250 bp 



Table 12. Rate of recombination per physical distance (cM/Mb) across the genome Genethon marker, 
were placed on CSA-mapped assemblies, and then relative physical d^nc^^S^^^^ 
in 3-Mb windows for each chromosome. NA, not applicable. calculated 



Chrom. 



Male 



Sex-average 



Female 



1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
X 
Y 

Genome 



Max. 


Avg. 


Min. 


Max. 


2.60 


1.12 


0.23 


2.81 


2.23 


0.78 


0.33 


2.65 


2.55 


0.86 


0.23 


2.40 


1.66 


0.67 


0.15 


2.06 


2.00 


0.67 


0.18 


1.87 


1.97 


0.71 


0.28 


2.57 


2.34 


1.16 


0.48 


1.67 


1.83 


0.73 


0.14 


2.40 


2.01 


0.99 


0.53 


1.95 


3.73 


1.03 


0.22 


3.05 


1.43 


0.72 


0.31 


2.13 


4.12 


0.76 


0.26 


3.35 


1.60 


0.75 


0.01 


1.87 


3.15 


0.98 


0.18 


2.65 


2.28 


0.94 


0.34 


231 


1.83 


1.00 


0.47 


2.70 


3.87 


0.87 


0.00 


3.54 


3.12 


1.37 


0.86 


3.75 


3.02 


0.97 


0.10 


2.57 


3.64 


0.89 


0.00 


2.79 


3.23 


1.26 


0.69 


2.37 


1.25 


1.10 


0.84 


1.88 


NA 


NA 


NA 


NA 


NA 


NA 


NA 


NA 


4.12 


0.88 


0.00 


3.75 



Avg. 


Min. 


Max. 


Avg. 


Min. 


1.42 


0.52 


3.39 


1.76 


0.68 


1.12 


0.54 


3.17 


1.40 


0.61 


1.07 


0.42 


2.71 


1.30 


0.33 


1.04 


0.60 


2.50 


1.40 


0.77 


1.08 


0.42 


2.26 


1.43 


0.62 


1.12 


0.37 


3.47 


1.67 


0.64 


1.17 


0.47 


2.27 


1.21 


0.34 


1.05 


0.46 


3.44 


1.36: 


0.43 


1.32 


0.77 


2.63 


'1.66 


0.82 


1.29 


0.66 


2.84 


1.51 


0.76 


0.99 


0.47 


3.10 


1.32 


0.49 


1.16 


0.49 


2.93 


1.55 


0.59 


0.95 


0.17 


2.49 


1.19 • 


0.32 


1.30 


0.62 


3.14 


1.63 


0.75 


1.22 


0.42 


2.53 


1.56 


0.54 


1.55 


0.63 


4.99 


2.32 


1.12 


1.35 


0.54 


4.19 


1.83 


0.94 


1.66 


0.43 


4.35 


2.24 


0.72 


1.41 


0.49 . 


2.89 


1.75 


0.87 


1.50 


0.83 


3.31 


2.15 


1.34 


1.62 


1.08 


2.58 


1.90 


1.18 


1.41 


1.08 


3.73 


2.08 


0.93 


NA 


NA 


3.12 


1.64 


0.72 


NA 


NA 


NA 


NA 


NA 


1.22 


0.17 


4.99 


1.55 


0.32 
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that account for gene inactivation. The gen- 
eral structural characteristics of these pro- 
cessed pseudogenes include the complete 
lack of intervening sequences found in the. 
functional counterparts, a poly(A) tract at the 
3' end, and direct repeats flanking the pseu- 
dogene sequence. Processed pseudogenes oc- 
cur as a result of retrotransposition, whereas 
unprocessed pseudogenes arise from segmen- 
tal genome duplication. 

We searched the complete set of Otto- 
predicted transcripts against the genomic se- 
quence by means of BLAST. Genomic re- 
gions corresponding to all Otto-predicted 
transcripts were excluded from this analysis. 
We identified 2909 regions matching with 
greater than 70% identity over at least 70% of 
the length of the transcripts that likely repre- 
sent processed pseudogenes. This number is 
probably an underestimate because specific 
methods to search for pseudogenes were not 
used. 

We looked for correlations between 
structural elements and the propensity for 
retrotransposition in the human genome. 
GC content and transcript length were com- 
pared between the genes with processed 



The human genome 

pseudogenes (1177 source genes) .versus 
the remainder of the predicted gene set. 
Transcripts that give rise to processed pseu- 
dogenes have shorter average transcript 
length (1027 bp versus 1594 bp for the Otto 
set) as compared with genes for which no 
pseudogene was detected. The overall GC 
content did not show any significant differ- 
ence, contrary to a recent report (88). There 
is a clear trend in gene families that are 
present as processed pseudogenes. These 
include ribosomal proteins (67%), lamin 
receptors (10%), translation elongation fac- 
tor alpha (5%), and HMG-non-histone pro- 
teins (2%). The increased occurrence of 
retrotransposition (both intronless paralogs 
and processed pseudogenes) among genes 
involved in translation and nuclear regula- 
tion may reflect an increased transcription- 
al activity of these genes. 



5.3 Gene duplication in the human 
genome 

Building on a previously published procedure 
(27), we developed a graph-theoretic algo- 
rithm, called Lek, for grouping the predicted 
human protein set into protein families (89). 



Table 13. Characteristics of CpC islands identified in chromosome 22 (34-Mbp sequence length) and the 
likelihood ratio of >0.6. Method 2 uses a CG likelihood ratio of >0.8. 



Chromosome 22 



Whole genome 
(CS assembly) 



Number of CpC islands 

detected 
Average length of island (bp) 
Percent of sequence 

predicted as CpG 
Percent of first exons that 

overlap a CpC island 
Percent of first exons with 

first position of exon 

contained inside a CpC 

island 

Average distance between 

first exon and closest CpC 

island (bp) 
Expected distance between 

first exon and closest CpC 

island (bp) 



Method 1 


Method 2 


Method 1 


Method 2 


5.211 


522 


195.706 


26,876 


390 


535 


395 


497 


5.9 


0.8 


2.6 


0.4 


44 


25 


42 


22 


37 


22 


40 


21 


1,013 


10.486 


2,182 


17,021 


3,262 


32,567 


7,164 


55,811 



Table 14. Distribution of repetitive DNA in the compartmentalized shotgun assembly sequence. 
Repetitive elements 

Alu 

Mammalian interspersed repeat (MIR) 
Medium reiteration (MER) 
Long terminal repeat (LTR) 
Long interspersed nucleotide element 
(LINE) 

Total 



Megabases in 
assembled 
sequences 


Percent 
of 

assembly 


Previously 
predicted 
[%) (83) 


288 


9.9 


10.0 


66 


2.3 


1.7 


50 


1.7 


1.6 


155 


5.3 


5.6 


466 


16.1 


16.7 


1025 


35.3 


35.6 



The complete clusters that result from the 
Lek clustering provide one basis for compar- 
ing the role of whole-genome or chromosom- 
al duplication in protein family expansion as 
opposed to other means, such as tandem du- 
plication. Because each complete cluster rep- 
resents a closed and certain island of homol- 
ogy, and because Lek is capable, of simulta- 
neously clustering protein . complements of 
several organisms, the number of proteins 
contributed by each organism to a complete 
cluster can be predicted with confidence de- 
pending on the quality of the annotation of 
each genome. The variance of each organ- 
ism's contribution to each cluster can then be 
calculated, allowing an assessment of the rel- 
ative, importance of large-scale duplication 
versus smaller-scale, organism-specific ex- 
pansion and contraction of protein families, 
presumably as a result of natural selection 
operating on individual protein families with- 
in an organism. As can be seen in Fig. 12, the 
large variance in the relative numbers of hu- 
man as compared with & melanogaster and 
Caenorhabditis elegans proteins in complete 
clusters may be explained by multiple events 
of relative expansions in gene families in 
each of the three animal genomes. Such ex- 
pansions would give rise to the distribution 
that shows a peak at 1:1 in the ratio for 
human-worm or human-fly clusters with the 
slope spread covering both human and fly/ 
worm predominance, as we observed (Fig. 
12). Furthermore, there are nearly as many 
clusters where worm and fly proteins pre- 
dominate despite the larger numbers of pro- 
teins in the human. At face value, this anal- 
ysis suggests that natural selection acting on 
individual protein families has been a major 
force driving the expansion of at least some 
elements of the human protein set However, 
in our analysis, the difference between an 
ancient whole-genome duplication followed 
by loss, versus piecemeal duplication, cannot 
be easily distinguished. In order to differen- 
tiate these scenarios, more extended analyses 
were performed. 



5.4 Large-scale duplications 

Using two independent methods, we 
searched for large-scale duplications in the 
human genome. First, we describe a protein 
family-based method that identified highly 
conserved blocks of duplication. We then 
describe our comprehensive method for identi- 
fying all interchromosomal block duplications. 
The latter method identified a large number of 
duplicated chromosomal segments covering 
parts of all 24 chromosomes. 

The first of the methods is based on the 
idea of searching for blocks of highly con- 
served homologous proteins that occur in 
more than one location on the genome. For 
this comparison, two genes were considered 
equivalent if their protein products were de- 



ft 
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termined to be in the same family and tl 
same complete Lek cluster (essentially 
paralogous genes) (89). Initially, each chro- 
mosome was represented as a string of genes 

•ordered by the start codons for predicted 
enes along the chromosome. We considered 
ie two strands as a single string, because 
local inversions are relatively common events 
relative to large-scale duplications. Each 
gene was indexed according to the protein 
family and Lek complete cluster (89). All 
pairs of indexed gene strings /were then 
aligned in both the forward and reverse di- 
rections with the Smith-Waterman algorithm 
(90). A match between two proteins of the 
same Lek complete cluster was given a score 
of 10 and a mismatch -10, with gap open 
and extend penalties of -4 and -1. With 
these parameters, 19 conserved interchromo- 
somal blocks of duplication were observed, 
all of which were also detected and expanded 
by the comprehensive method described be- 
low. The detection of only a relatively small 
number of block duplications was a conse- 
quence of using an intrinsically conservative 
method grounded in the conservative con- 
straints of the complete Lek clusters. 

In the second, more comprehensive ap- 
proach, we aligned all chromosomes directly 
with one another using an algorithm based on 
the MUMmer system (91). This alignment 
method uses a suffix tree data structure and a 
linear-time algorithm to align long sequences 
very rapidly; for example, two chromosomes 
00 Mbp can be aligned in less than 20 
(on a Compaq Alpha computer) with 4 
gigabytes of memory. This procedure was 
used recently to identify numerous large- 
scale segmental duplications among the five 
chromosomes of A. thaliana (92); in that 
organism, the method revealed that 60% of 
the genome (66 Mbp) is covered by 24 very 
large duplicated segments. For Arabidopsis, a 
DNA-based alignment was sufficient to re- 
veal the segmental duplications between 
chromosomes; in the human genome, DNA 
alignments at the whole-chromosome level 
are insufficiently sensitive. Therefore, a mod- 
ified procedure was developed and applied, 
as follows. First, all 26,588 proteins 
(9,675,7 1 3 million amino acids) were concat- 
enated end-to-end in order as they occur 
along each of the 24 chromosomes, irrespec- 
tive of strand location. The concatenated pro- 
tein set was then aligned against each chro- 
mosome by . the MUMmer algorithm. The 
resulting matches were clustered to extract all 
sets of three or more protein matches that 
occur in close proximity on two different 
chromosomes (93); these represent the can- 
didate segmental duplications. A series of 
filters were developed and applied to remove 
lij^false-positives from this set; for exam- 
j^Bpall blocks that were spread across 
n^^prof 
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filtering methods, a shuffled protein set was 
first created by taking the 26,588 proteins, 
randomizing their order, and then partitioning 
them into 24 shuffled chromosomes, each 
containing the same number of proteins as the 
true genome. This shuffled protein set has the 
identical composition to the real genome; in 
particular, every protein and every domain 
appears the same number of times. The com- 
plete algorithm was then applied to both the 
real and the shuffled data, with the results on 
the shuffled data,being used to estimate the 
false-positive rate. The algorithm after filter- 
ing yielded 10,310 gene pairs in 1077 dupli- . 
cated blocks containing 3522 distinct genes; 
tandemly duplicated expansions in many of 
the blocks explain the excess of gene pairs to 
distinct genes. In the shuffled data, by con- 
trast, only 370 gene pairs were found, giving 
a false-positive estimate of 3.6%. The most 
likely explanation for the 1077 block dupli- 
cations is ancient segmental duplications. In 
many cases, the order of the proteins has been 
shuffled, although proximity is preserved. 
Out of the .1077 blocks, 159 contain only 
three genes, 137 contain four genes, and 781 
contain five or more genes. 

To illustrate the extent of the detected 
duplications. Fig. 13 shows all 1077 block 
duplications indexed to each chromosome in 
24 panels in which only duplications mapped 
to the indexed chromosome are displayed. 
The figure makes it clear that the duplications 
are ubiquitous in the genome. One feature 
that it displays is many relatively small chro- 
mosomal stretches, with one-to-many dupli- 
cation relationships that are graphically strik- 
ing. One such example captured by the anal- 
ysis is the well-documented olfactory recep- 
tor (OR) family, which is scattered in blocks 
throughout the genome and which has been 
analyzed for genome-deployment reconstruc- 



. jS at several evolutionary stages (94). The 
figure also illustrates that some chromo- 
somes, such as chromosome 2, contain many 
more detected large-scale duplications than 
-others. Indeed, one of the largest duplicated 
segments is a large block of 33 proteins on 
chromosome 2, spread among eight smaller 
blocks in 2p, that aligns to a paralogous set on 
chromosome 14, with one rearrangement (see 
chromosomes 2 and 14 panels in Fig. 13). 
.The proteins are not contiguous but span a 
region containing 9.7 proteins on chromo- . 
some 2 and 332 proteins on chromosome 14. 
The likelihood of observing this many dupli- 
cated proteins by chance, even over a span of 
this length, is 2.3 X 10" 68 (93). This dupli- 
cated set spans 20 Mbp on chromosome 2 and 
63 Mbp on chromosome 14, over 70% of the 
latter chromosome. Chromosome 2 also con- 
tains a block duplication that is nearly as 
large, which is shared by chromosome arm 2q 
and chromosome 12. This duplication incor- 
porates two of the four known Hox gene 
clusters, but considerably expands the extent 
of the duplications proximally and distally on 
the pair of chromosome arms. This breadth of 
duplication is also seen on the two chromo- 
somes carrying the other two Hox clusters. 

An additional large duplication, between 
chromosomes 18 and 20, serves as a good 1 
example to illustrate some of the features 
common to many of the other observed large 
duplications (Fig. 13, inset): This duplication 
contains 64 detected ordered intrachromo- 
somal pairs of homologous genes. After dis- 
counting a 40-Mb stretch of chromosome 18 
free of matches to chromosome 20, which is 
likely to represent a large insert (between the 
gene assignments "Krup rel" and "collagen 
rel" on chromosome 18 in Fig. 13), the full 
duplication segment covers 36 Mb on chro- 
mosome 18 and 28 Mb on chromosome 20. 
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By this measure, the duplication segment 
spans nearly half of each chromosome's net 
length. The most likely scenario is that the 
whole span of this region was duplicated as a 
single very large block, followed by shuffling 
owing to smaller scale rearrangements. As 
such, at least four subsequent rearrangements 
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pair , of duplicated chromosome regions was 
observed in many compared regions. Hypothe- 
ses to explain which mechanisms foster these 
processes must be tested 

Evaluation of the alignment results gives 
some perspective on dating of the duplications. 
As. noted above, large-scale ancient segmental 



i « , , . - _ . . "> — fc>~ ojiw^ui segmental 

would need to be invoked to explain the ...duplication in fact best, explains many of the 



relative insertions and inversions seen in the 
duplicated segment interval. The 64 protein 
pairs in this alignment occur among 217 pro- 
tein assignments on chromosome 1 8, and 
among 322 protein assignments on chromo- 
some 20, for a density of involved proteins of 
20 to 30%. This is consistent with an ancient 
large-scale duplication followed by subse- 
quent gene loss on one or both chromosomes. 
Loss of just one member of a gene pah- 
subsequent to the duplication would result in 
a failure to score a gene pair in the block; less 
than 50% gene loss on the chromosomes 
would lead to the duplication density ob- 
served here. As* an independent verification 
of the significance of the alignments detect- 
ed, it can be seen that a substantial number of . 
the pairs of aligning proteins in this duplica- 
tion, including some of those annotated (Fig 
13), are those populating small Lek complete 
clusters (see above). This indicates that they 
are members of very small families of para- 
logs; their relative scarcity within the genome 
validates the uniqueness and robust nature of 
their alignments. 

Two additional qualitative features were ob- 
served among many of the large-scale duplica- 
tions. First, several proteins with disease asso- 
ciations, with OMIM (Online Mendelian Inher- 
itance in Man) assignments, are members of 
duplicated segments (see web table 2 on Sci- 
ence Online at www.sciencemag.org/cgi/con- 
tent/full^91/5507/1304/DCl). We have also 
observed a few instances where paralogs on 
both duplicated segments are associated with 
similar disease conditions. Notable among 
these genes are proteins involved in hemostasis 
(coagulation factors) that are associated with 
bleeding disorders, transcriptional regulators 
like the homeobox proteins associated with de- 
velopmental disorders, and potassium channels 
associated with cardiovascular conduction ab- 
normalities. For each of these disease genes, 
closer study of the paralogous genes in the 
duplicated segment may reveal new insights 
into disease causation, with further investiga- 
tion needed to determine whether they might be 
involved in the same or similar genetic diseases. 
Second, although there is a conserved number 
of proteins and coding exons predicted for spe- 
cific large duplicated spans within the chromo- 
some 18 to 20 alignment, the genomic DNA of 
chromosome 18 in these specific spans is in 
some cases more than 10-fold longer than the 
corresponding chromosome 20 DNA. This se- 
lective accretion of noncoding DNA (or con- 
versely, loss of noncoding DNA) on one of a 



blocks detected by this genome-wide analysis. 
The . regions of human chromosomes involved 
in the large-scale duplications expanded upon 
above (chromosomes 2 to 14, 2 to 12, and 18 to 
20) are each syntenic to a distinct mouse chro- 
mosomal region. The corresponding mouse 
. • . , chromosomal regions are much more similar in 
sequence conservation, and even in order, to 
their human synteny partners than the human 
duplication regions are to each other. Further, 
■• the corresponding mouse chromosomal regions 
each, bear a significant proportion of genes or- 
• thologous to the human genes on which the 
human duplication assignments were made. On 
; the basis of these , factors, the corresponding 
mouse chromosomal spans, at coarse resolu- 
, tion, appear to be products of the same large- 
scale duplications observed in humans. Al- 
though further detailed analysis must be carried 
out once a more complete genome is assembled 
for mouse, the underlying large duplications 
appear to predate the two species'^ divergence. - 
This dates the duplications, at the latest, before 
divergence of the primate and rodent lineages. 
This date can be further refined upon examina- 
tion of the synteny between human chromo- 
somes and those of chicken, pufferfish (Fugu 
rubripes), or zebrafish (95). The only sub- 
stantial syntenic stretches mapped in these 
species corresponding to both pairs of human 
duplications are restricted to the Hox cluster 
regions. When the synteny of these regions 
(or others) to human chromosomes is extend- 
ed with further mapping, the ages of the 
nearly chromosome-length duplications seen 
in humans are likely to be dated to the root of 
vertebrate divergence. 

The MUMmer-based results demonstrate 
large block duplications that range in size from 
a few genes to segments covering most of a 
chromosome. The extent of segmental duplica- 
tions raises the question of whether an ancient 
whole-genome duplication event is the under- 
lying explanation for the numerous duplicated 
regions (96). The duplications have undergone 
many deletions and subsequent rearrangements; 
these events make it difficult to distinguish 
between a whole-genome duplication and mul- 
tiple smaller events. Further analysis, focused 
especially on comparing the estimated ages of 
all the block duplications, derived partially 
from interspecies genome comparisons, will be 
necessary to determine which of these two hy- 
potheses is more likely. Comparisons of ge- 
nomes of different vertebrates, and even cross- 
phyla genome comparisons, will allow for the 
deconvolution of duplications to eventually re- 



veal the stagewise history of our genorne and 
with it a history of the emergence of many ^ 
the key functions that distinguish us from otf** 
living things. 

6 A Genome-Wide Examinati n f 
Sequence Variations 

Summary. Computational methods were used 
to identify single-nucleotide polymorphism, 
(SNPs) by comparison of the Celera sequence 
to other SNP resources. The SNP rate be- 
tween two chromosomes was —1 per 1200 lo 
1500 bp. SNPs are distributed nonrandomly 
throughout the genome. Only a very small 
proportion of all SNPs (<1%) potentially 
impact protein function based on the func- 
tional analysis of SNPs that affect the pre. 
dieted coding regions. This results in an cs« 
timate that only thousands, not millions, of 
genetic. variations may contribute to the struc- 
tural diversity of human proteins. 



.Having a complete genome sequence cnnblcj 
researchers to achieve a dramatic acceleration 
in the rate of gene discovery, but only through 
analysis of sequence variation in DNA can wc 
discover the genetic basis for variation in health 
among human beings. Whole-genome shotgun 
sequencing is a particularly effective method 
for detecting sequence variation in tandem with 
whole-genome assembly. In addition, we com- 
pared the distribution and attributes of SNPs 
ascertained by three other methods: (i) align- 
ment of the Celera consensus sequence to the 
PFP assembly, (ii) overlap of high-quality reads 
of genomic sequence (referred to as "Kwok"; 
1,120,195 SNPs) (P7), and (iii) reduced repre- 
sentation shotgun sequencing (referred to as 
'TSC"; 632,640 SNPs) (98). These data were 
■> consistent in showing an overall nucleotide di- 
versity of ~8 X 10~ 4 , marked heterogeneity 
across the genome in SNP density, and an 
overwhelming preponderance of noncoding 
variation that produces no change in expressed 
proteins. 



6.1 SNPs found by aligning the Cel ra 
consensus to the PFP assembly 

Ideally, methods of SNP discovery make full 
use of sequence depth and quality at every site, 
and quantitatively control the rate of false-pos- 
itive and false-negative calls with an explicit 
sampling model (99). Comparison of consensus 
sequences in the absence of these details neces- 
sitated a more ad hoc approach (quality scores 
could not readily be obtained for the PFP as- 
sembly). First, all sequence differences between 
the two consensus sequences were identified; 
these were then filtered to reduce the contribu- 
tion of sequencing errors and misassembly. As 
a measure of the effectiveness of the filtering 
step, we monitored the ratio of transition and 
transversion substitutions, because a 2 : 1 ratio 
has been well documented as typical in mam- 
malian evolution (100) and in human SNPs 
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001, 102). The filtering steps consisted of,' 
moving variants where the quality score in the 
Celera consensus was less than 30 and where 
fte dens.ty of variants was greater than 5 in 400 
bp- These filters resulted in shifting the transi- 
■on-to-transversion ratio from 1.57 1 to 
59:1. When applied to 2.3 Gbp of alignments 
*ween the Celeia and PFP consensus se 

Tf mm reSulted ™ id entification 

820 P^tive SNPs from a total of 
2,778,474 substitution differences. Overlaps 
between this set of SNPs and those found by 
other methods are described below. 

6.2 Comparisons to public SNP 
databases 

AuSH 0n ? 1 SNPS ' includin 8 2 >536,021 from 
dbSNP (www.ncbi.nlm.nih.gov/SNP) and 
13,150 from HGMD (Human Gene Muta- 
tion Database, from the University of 
wales, UK), were mapped on the Celera con- 
sensus sequence by a sequence similarity 
search with the program PowerBlast (103) The 
two ^gest data sets in dbSNP are the Kwok 
and TSC sets, with 47% and 25% of the dbSNP 
records. Low-quality alignments with partial 
coverage of the dbSNP sequence and align- 
ments that had less than 98% sequence identity 
Between the Celera sequence and the dbSNP 
flanking sequence were eliminated. dbSNP se- 
quences mapping to multiple locations on the 

^7o^ en l^ m Were discarded A total of 
2,336 935 dbSNP variants were mapped to 
t,Z23,0 38 umque locations on the Celera se- 
]jSr> implying considerable redundancy in 
V; SNPs in the TSC set mapped to 
585^11 unique genomic locations, and SNPs in 
the Kwok set mapped to 438,032 unique loca- 
tions. The combined unique SNPs counts used 
in this analysis, including Celera-PFP TSC 
andKwokis 2,737,668. Table 15 shows that a' 
substantial fraction of SNPs identified by one of 
ftese methods was also found by another meth- 
od The very high overlap (36.2%) between the 

Kwok and Celera-PFP SNPs may be due in part 
to the^e by Kwok of sequences that went into 
™ ^ 1 assemblv - ™e unusually low overlap 
(1 6.4%) between the Kwok and TSC sets is due 



Jmd !, ] i erlap of SNPs fr<"" genome-wide 

lZt a ab ?!f S - Table entries are SNP counts for 
each pa.r of data sets. Numbers in parentheses are 
the frachon of overlap, calculated as the count^f 
overtapp,ng SNPs divided by the number of SNPs 

TotS 5 Np a " er ° f , the **° databa$ « compared 
I« ', SNP counts for th « databases are- CelJa 
W, 2,104,820; TSC, 585,81 1; and Kwok 032* 
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to their being the smallest two sets. In addition 
245% of the Celera-PFP SNPs overlap with 
SNPs derived from the Celera genome se- 
quences (46). SNP validation in population 
samples is an expensive and laborious process 
so confirmation on multiple data sets may pro-' 
vide an efficient initial validation "in silico" (by 
computational analysis). 

One means of assessing whether the 
three sets of SNPs provide the same picture 
ot human variation.is to tally the frequen- 
cies of the six possible base changes in 
each set of SNPs (Table 16). Previous mea- 
sures of nucleotide diversity were mostly 
derived from small-scale analysis on can- 
didate genes (101), and our analysis with 
all three data sets validates the previous 
observations at the whole-genome scale 
There is remarkable homogeneity between 
the SNPs found in the Kwok set, the TSC 
set and in our whole-genome shotgun (46) 
in this substitution pattern. Compared with 
the rest of the data sets, Celera-PFP devi- 
ates slightly from the 2:1 transition-to- 
nansversion ratio observed in the other 
SNP sets. This result is not unexpected . 
because some fraction of the computation- 
ally identified SNPs. in the Celera-PFP 
comparison may in fact be sequence errors 
A 2:1 transition.transversion ratio for the 
bona fide SNPs would be obtained if one 
assumed that 15% of the sequence differ- 
ences in the Celera-PFP set were a result of 
(presumably random) sequence errors. 

6.3 Estimation of nucleotide diversity 
from ascertained SNPs 

The number of SNPs identified varied 
widely across chromosomes. In order to 
normalize these values to the chromosome 
size and sequence coverage, we used ir the 

S nn??i S f iStk for nucle otide diversity 
(104). Nucleotide diversity is a measure of 
per-site heterozygosity, quantifying the 
probabihty that a pair of chromosomes 
drawn from the population will differ at a 
nucleotide site. In order to calculate nucle- 
otide diversity for each chromosome, we 
need to know the number of nucleotide 
sites that were surveyed for variation, and 
m methods like reduced respresentation se- 
quencing, we need to know the sequence 
quality and the depth of coverage at each 



s»e. These data are not readily available so 
we could not estimate nucleotide diversity 
from the TSC effort. Estimation of nucleo 
tide diversity from high-quality sequence 
overlaps should be possible, but again 
TJ> 'f O ™ ati0n is needed on the details' 
of all the alignments. 

Estimation of nucleotide diversity from a 
shotgun assembly entails calculating for each 
■ column of the multialignment, the probability 
that two or more distinct alleles are present 
and the probability of defecting a SNP if in 
fact the alleles have different sequence (i e 
the probability of correct sequence calls) The 
greater the depth of coverage and the higher 
the sequence quality, the higher is the chance 
of successfully detecting a SNP (105). Even 
after correcting for variation in coverage the 
nucleotide diversity appeared to vary across 
autosomes. The significance of this heteroge- 
neity was tested by analysis of variance with 
, estimates of ir for 100-kbp windows to esti- 
mate variability within chromosomes (for the 
Celera-PFP comparison, F = 20 7v p <r 
0.0001). ' r < 

. Average diversity for the autosomes es- 
timated from the Celera-PFP comparison 
was 8.94 X 10- Nucleotide divers^ on 
tne A chromosome was 6.54 X 10~ 4 The 
X is expected to be less variable than au- 
tosomes, because for every four copies of 
autosomes in the population, there are only 
three X chromosomes, and this smaller ef- 
fective population size means that random 
drift wilfmore rapidly remove variation 
from the X (106). 

Having ascertained nucleotide variation 
genome-wide, it appears that previous esti- 
mates of nucleotide diversity in humans 
based on samples of genes were reasonably 
accurate (101, 102, 106, 107). Genome-wide 
our estimate of nucleotide diversity was' 
8.98 X 10-- for the Celera-PFP alignment, 
•and a published estimate averaged over 10 

t^w-ToT hWDm genes was 

6.4 Variation in nucleotide diversity 
across the human genome 

Such an apparently high degree of variabil- 
ity among chromosomes in SNP density 
raises the question of whether there is her- 



4 ""^-"^i uicic is net 

erogeneity at a finer scale within chromo 

Table 16. Summary of nucleotide changes in different SNP data sets. 
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188,694 
(0.322) 



158.532 
(0.362) 
72,024 

(0.164) 
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Fig. 13. Segmental duplica- 
tions between chromo- 
somes in the human ge- 
nome The 24 panels show 
the 1077 duplicated blocks 
of genes, containing 10,310 
pairs of genes in total Each 
line represents a pair of ho- 
mologous genes belonging 
to a block; all blocks con- 
tain at least three genes 
on each of the chromo- 
somes where they appear. 
Each panel shows all the 
duplications between a 
single chromosome and 
other chromosomes with 
shared blocks. The chro- 
mosome at the center of 
each panel is shown as a 
thick red line for emphasis. 
Other chromosomes are 
displayed from top to bot- 
. torn within each panel or- 
dered by chromosome 
number. The inset (bot- 
tom, center right) shows a 
dose-up of one duplica- 
tion between chromo- 
somes 18 and 20, expand- 
ed to display the gene 
names of 12 of the 64 
gene pairs shown. 
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somes, and whether this heterogeneity is 
greater than expected by chance. If SNPs 
occur by random and independent mutations, 
then it would seem that there ought to be a 
Poisson distribution of numbers of SNPs in . 
fragments of arbitrary constant size. The ob- 
served dispersion in the distribution of SNPs 
in 100-kbp fragments was far greater than 
predicted from a Poisson distribution (Fig. 
14). However, this simplistic model ignores 
the different recombination rates and popula- 
tion histories that exist in different regions of 
the genome. Population genetics theory holds 
that we can account for this variation with a 
mathematical formulation called the neutral . 
coalescent (109). Applying well-tested algo- 
rithms for simulating the neutral coalescent 
with recombination (110), and using an ef- : 
fective population size of 10,000 and a per- 
base recombination rate equal to the mutation 
rate (lll) f we generated a distribution of num- 
bers of SNPs by this model as well (112), The 
observed distribution of SNPs has a much larg- - 
er variance than either the Poisson model or the 
coalescent model, and the difference is highly 
significant This implies that there is significant 
variability across the genome in SNP density, 
an observation that begs an explanation. 

Several attributes of the DNA sequence 
may affect the local density of SNPs, in- 
cluding the rate at which DNA polymerase 
makes errors and the efficacy of mismatch 
repair. One key factor that is likely to be 
associated with SNP density is the G+C 
content, in part because methylated cy- 
tosines in CpG dinucleotides tend to under- 
go deamination to form thymine, account- 
ing for a nearly 10-fold increase in the 
mutation rate of CpGs over other dinucle- 



otides. We tallied the GC content and nu- 
cleotide diversities in 100-kbp windows 
across the entire genome and found that the 
correlation between them was positive (r = 
0.21) and highly significant (P < 0.0001), 
but G+C content accounted for only a 
small part of the variation. 

6.5 SNPs by genomic class 

,To :test homogeneity of SNP densities 
across functional classes, we partitioned 
sites into intergenic (defined as >5 kbp 
from any predicted transcription unit), 5'- 
UTR, exonic (missense and silent), in- 
tronic, and 3'-UTR for . 10,239 known 
genes, derived from the NCBI .RefSeq da- 
tabase and all human genes predicted from 
.the Celera Otto annotation. In coding re- 
gions, SNPs were categorized as either si- 
lent, for those that do not change amino 
acid sequence, or missense, for those that 
change the protein product. The ratio of 
missense to silent coding SNPs in Celera- 
PFP, TSC, and Kwok sets (1.12, 0.91, and 
0.78, respectively) shows a markedly re- 
duced frequency of missense variants com- 
pared with the neutral expectation, consis- 
tent with the elimination by natural selec- 
tion of a fraction of the deleterious amino 
acid changes (112). These ratios are com- • 
parable to the missense-to-silent ratios of 
0.88 and 1/17 found by Cargill et al. (101) 
and by Halushka et al (1 02). Similar re- 
sults were observed in SNPs derived from 
Celera shotgun sequences (46), 

It is striking how small is the fraction of 
SNPs that lead to potentially dysfunctional 
alterations in proteins. In the 10,239 Ref- 
Seq genes, missense SNPs were only about 




Number of SNPs / 100 kb 

Fig. 14. SNP density in each 100-kbp interval as determined with Celera-PFP SNPs. The color codes 
are as follows: black, Celera-PFP SNP density; blue, coalescent model; and red, Poisson distribution. 
The figure shows that the distribution of SNPs along the genome is nonrandom and is not entirely 
accounted for by a coalescent model of regional history. 



0.12, 0.14, and 0.17% of the total SNP 
counts in Celera-PFP, TSC, and Kwok 
SNPs, respectively. Nonconservative pro- 
tein changes constitute an even smaller frac- 
tion of missense SNPs (47, 41, and 40% in 
Celera-PFP, Kwok, and TSC). Intergenic re- 
gions have been virtually unstudied (113), and 
we note that 75% of the SNPs we identified 

. were intergenic (Table 17). The SNP rate was 
highest in introns and lowest in exons. The SNP 
rate was lower in intergenic regions than in 
introns, providing one of the first discriminators 
between these two classes of DNA. These SNP 
rates were confirmed in the Celera SNPs, which 

. also exhibited a lower rate in exons than in 
introns, and in extragenic regions than in in- 
trons (46). Many of these intergenic SNPs will 

.-provide valuable information in the form of 
markers for linkage and association studies, and 

* some fraction is likely to have a regulatory 
function as well. 

7 An Overview of the Predicted 
Protein-Coding Genes in the Human 
Genome 

Summary. This section provides an initial 
computational analysis of the predicted 
protein' set with the aim of cataloging 
prominent differences and similarities 
when the human genome is compared with 
other fully sequenced eukaryotic genomes. 
Over 40% of the predicted protein set in 
humans cannot be ascribed a molecular 
function by methods that assign proteins to 
known families. A protein domain- based 
analysis provides a detailed catalog of the 
prominent differences in the human ge- 
nome when compared with the fly and 
. worm genomes. Prominent among these are 
domain expansions in proteins involved in 
developmental regulation and in cellular 
processes such as neuronal function, hemo- 
stasis, acquired immune response, and cy- 
toskeletal complexity. The final enumera- 
tion of protein families and details of pro- 
tein structure will rely on additional exper- 
imental work and comprehensive manual 
curation. 

A preliminary analysis of the predicted hu- 
man protein-coding genes was conducted. 
Two methods were used to analyze and clas- 
sify the molecular functions of 26,588 pre- 
dicted proteins that represent 26,383 gene 
predictions with at least two lines of evidence 
as described above. The first method was 
based on an analysis at the level of protein 
families, with both the publicly available 
Pfam database (114, 115) and Celera's Pan- 
ther Classification (CPC) (Fig. 15) (116). 
The second method was based on an analysis 
at the level of protein domains, with both the 
Pfam and SMART databases (115, 117). 

The results presented here are prelimi- 
nary and are subject to several limitations. 
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Both the gene predictions and functional 
assignments have been made by using com- 
putational tools, although the statistical 
models in Panther, Pfam, and SMART have 
'fen built, annotated, and reviewed by ex- 
' rt biologists. In the set of computationally 
predicted genes, we expect both false-positive 
predictions (some of these may in fact be inac- 
tive pseudogenes) and false-negative predic- 
tions (some human genes will not be computa- 
tionally predicted). We also expect errors in 
delimiting the boundaries of exons and genes. 
Similarly, in the automatic functional assign- 
ments, we also expect both false-positive and 
false-negative predictions. The functional as- 
signment protocol focuses on protein famines 
that tend to be found across several organisms, 
or on families of known human genes. There- 
fore, we do not assign a function to many genes 
that are not in large families, even if the func- 
tion is known. Unless otherwise specified, all 
enumeration of the genes in any given family or 
functional category was taken from the set of 
26,588 predicted proteins, which were assigned 
functions by using statistical score cutoffs de- 
fined for models in Panther, Pfam, and 
SMART. 

For this initial examination of the pre- 
dicted human protein set, three broad ques- 
tions were asked: (i) What are the likely 
molecular functions of the predicted gene 
products, and how are these proteins cate- 

«zed with current classification meth- 
I (ii) What are the core functions that 
-ar to be common across the animals? 
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(iii) How does the human protein comple- 
ment differ from that of other sequenced 
eukaryotes? 



7.1 Molecular functions of predict d 
human pr teins 

Figure 15 shows an overview of the puta- 
tive molecular functions of the predicted 
26,588 human proteins that have at least 
two lines of supporting evidence. . About 
41% (12,809) of the. gene products could 
not be classified from this initial analysis 
and are termed proteins with unknown 
functions. Because our automatic classifi- 
cation methods treat only relatively large 
protein families, there are a number of 
"unclassified" sequences that do, in fact, 
have a known or predicted function. For the 
60% of the protein set that have automatic 
functional predictions, the specific protein 
functions have been placed into broad 
classes. We focus here on molecular func- 
tion (rather than higher order cellular pro- 
cesses) in order to classify as many proteins 
as possible. These functional predictions 
are based on similarity to sequences of 
known function. 

In our analysis of the 12,731 additional low- 
confidence predicted genes (those with only 
one piece of supporting evidence), only 636 
(5%) of these additional putative genes were 
assigned molecular functions by the automated 
methods. One-third of these 636 predicted 
genes represented endogenous retroviral pro-, 
teins, further suggesting that the majority of 



these unknown-function genes are not real 

??JS' GiVCn m ° St 0f additional 
12,095 genes appear to be unique among the 
genomes sequenced to date, many may simply 
.represent raise-positive gene predictions. 

The most common molecular functions are 
the transcription factors and those involved in 
nucleic acid metabolism (nucleic acid enzyme) 
Other functions that are highly represented in 
the human genome are the receptors, kinases 
and hydrolases. Not surprisingly, most of the 
hydrolases are proteases. There are also many 
proteins that are members of proto-oncogene 
families, as well as families of "select regula- 
tory molecules": (i) proteins involved in specif- 
ic steps of signal transduction such as hetero- 
trimeric GTP-binding proteins (G proteins) and 
cell cycle regulators, and (ii) proteins that mod- 
ulate the activity of kinases, G proteins, and 
phosphatases. 

Table 17. Distribution of SNPs in classes of 
genomic regions. 



Genomic region 
class 



Size of 
region 
examined 
(Mb) 



Celera-PFP 
SNP 
density 
(SNP/Mb) 



Intergenic 
Gene (intron + 

exon) 
Intron 
First intron 
Exon 

First exon 



2185 
646 

615 
164 
31 
10 



707 
917 

921 
808 
529 
592 



cell adhesion (577, 1.9%) 
miscellaneous (13 18, 4.3%) 
viral protein (100, 0.3%), 
iransrci/carricr protein (203, 0.7%) 
transcription fcctor(l850, 6.0%) v \ 



\ 




nucleic acid eatymc (2308, 7.5%) 

signaling molecule (376, 1.2%) 
receptor ( 1 543, 5.0%) 

kinase (868. 2.8%) 

select regulatory molecule (988, 3.2%) 

transferase (6 1 0,2.0%) 
synthase and synthetase (313, 1 .0%) 
oxidoitductase(656,2.l%) 

lyase (1 1 7, 0.4%) ^ J 

\ '/ 



OjnoneSjr. t 




warn 



mm 



chapcrone(l59,0.5%) 

cytoskclctal structural protein (876, 2.8%) 
extracellular matrix (437. 1.4%) 
r immunoglobulin (264, 0.9%) 
ion channel (406, 1.3%) 
/ .motor (376.1.2%) 
m / / ^' structural protein of muscle (296, 1.0%) 
protooncogene (902. 2.9%) 

select calcium binding protein (34. 0.1%) 

- intracellular transporter (350, 1.1%) 

— transporter (533, 1.7%) 




Hgasc(56,0.2%)' 
»somcrasc(163,0.5%/ 
hydrolase (1227. 4.0%) 



rmmu£ 



^*GO categories 




molecular function unknown (12809, 41.7%) 



Fig. 15. Distribution 
of the molecular 
functions of 26,383 
human genes. Each 
slice lists the num- 
bers and percentages 
(in parentheses) of 
human gene functions 
assigned to a given 
category of molecular 
function. The outer cir- 
cle shows the assign- 
ment to molecular 
function categories in 
the Gene " Ontology 
(GO) (779), and the 
inner circle shows 
the assignment to 
Celera's Panther mo- 
lecular function cate- 
gories (776). 



Panther categories 
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7.2 Evolutionary c nservation of core 
proc ss s 

Because of the various "model organism" 
genome-sequencing projects that have al- 
ready been completed, reasonable compara- 
tive information is available for beginning the 
analysis of the evolution of the human ge- 
nome. The genomes of S. cerevisiae ("bak- 
ers' yeast") (118) and two diverse inverte- . 
brates, C. elegans (a nematode worm) (119) 
and D. melanogaster (fly) (26), as well as the 
first plant genome, A. thaliana, recently com- 
pleted (92), provide a diverse background for 
genome comparisons. 

We enumerated the "strict orthologs" con- 
served between human and fly, and between 
human and worm (Fig. 16) to address the 
question, What are the core functions that, 
appear to be common across the animals? 
The concept of orthology is important be- 
cause if two genes are orthologs, they can be 
traced by descent to the common ancestor of 
the two organisms (an "evolutionarily con- . 
served protein set"), and therefore are likely 
to perform similar conserved functions in the 
different organisms. It is critical in this anal- 
ysis to separate orthologs (a gene that appears 
in two organisms by descent from a common 
ancestor) from paralogs (a gene that appears 
in more than one copy in a given organism by 
a duplication event) because paralogs may 
subsequently diverge in function. Following 
the yeast-worm ortholog comparison in 



THE HUMAN GENOME 

. (120), we identified two different cases for 
each pairwise comparison (human-fly and 
human-worm). The first case was a pair of 
genes, one from each organism, for which 
there was no other close homolog in either 
organism. These are straightforwardly identi- 
fied as orthologous, because there are no 
„. additional members of the families that com- 
plicate separating orthologs from paralogs. 
The second case is a family of genes with 
more than one member in either or both of the 
organisms being compared. Chervitz et al 
(120) deal with this case by analyzing a 
phylogenetic tree that described the relation- 
ships between all of the sequences in both 
organisms, and then looked for pairs of genes 
that were nearest neighbors in the tree. If the 
nearest-neighbor pairs were from different 
organisms, those genes were presumed to be 
orthologs. We note that these nearest neigh- , 
bors can often be confidently identified from 
pairwise sequence comparison without hav- 
ing to examine a phylogenetic tree (see leg- 
end to Fig. 16). If the . nearest neighbors are 
not from different organisms, there has been 
a paralogous expansion in one or both organ- 
isms after the speciation event (and/or a gene 
loss by one organism). When this one-to-one 
correspondence is lost, defining an ortholog 
becomes ambiguous. For our initial compu- 
tational overview of the predicted human pro- 
tein set, we could not answer this question for 
every predicted protein. Therefore, we con- 



sider only, "strict orthologs," i.e., the proteins 
with unambiguous one-to-one relationships 
(Fig. 16). By these criteria, there are 2758 
strict human-fly orthologs, 2031 human- 
worm (1523 in common between these sets). 
, We define the evolutionarily conserved set as 
those 1523 human proteins that have strict 
orthologs in/both.,/), .melanogaster and C. 
elegans. 

. The distribution of the functions of the 
conserved protein set is shown in Fig. 16. 
Comparison with Fig. 15 shows that, not 
surprisingly, the set of conserved proteins is 
not distributed among molecular functions in 
the same way as the whole human protein set. 
Compared with the whole human set (Fig. 
15), there are several categories that are over- 
* represented in the conserved set by a factor of 
-2 or more. The first category is nucleic acid 
enzymes, primarily the transcriptional ma- 
chinery ■; (notably DNA/RNA methyltrans- 
ferases, DNA/RNA, polymerases, helicases, 
■ DNA, ligases, DNA- and RNA-processing 
factors, nucleases, and ribosomal proteins). 
The basic transcriptional and translational 
machinery is well known to have been con- 
served over evolution, from bacteria through 
to the most complex eukaryotes. Many ribo- 
nucleoproteins involved in RNA splicing also 
appear to be conserved among the animals. 
Other enzyme types are also overrepresent- 
ed. (transferases, oxidoreductases, ligases, 
lyases, and isomerases). Many of these en- 



Fig. 16. Functions of putative 
orthologs across vertebrate 
and invertebrate genomes. 
Each slice lists the number and 
percentages (in parentheses) 
of "strict orthologs" between 
the human, fly, and worm ge- 
nomes involved in a given cat- 
egory of molecular function, 
"Strict orthologs" are defined 
here as bi-directional BLAST 
best hits (780) such that each 
orthologous pair (i) has a 
BLASTP />-value of <10~ 10 
(720), and (ii) has a more sig- 
nificant BLASTP score than 
any paralogs in either organ- 
ism, i.e., there has likely been 
no duplication subsequent to 
speciation that might make 
the orthology ambiguous. This 
measure is quite strict and is a 
lower bound on the number of 
orthologs. By these criteria, 
there are 2758 strict human- 
fly orthologs, and 2031 hu- 
man-worm orthologs (1523 in 
common between these sets). 



cytoskeletal structural protein (20, 1 2%) 
. chapcronc(l6,0.9%), 
cell adhesion (11, 0.6%), 
miscellaneous (72, 42%) > 
viral protein (4, 0.2%) v 
transfer/carrier protein (II, 0.6%) - 
transcription factor (8 1 , 4.7%) . 



nucleic acid enzy me (221, 12.9%) 



extracellular matrix (12, 0.7%) 
ion channel (7, 0.4%) 
motor (13, 0.8%) 

structural protein of muscle (8, 0.5%) 
protooncogene (23, 1.3%) 

intracellular transporter (51, 3.0%) 

transporter (44, 2.6%) 



receptor (23. 1 J%) 



kinase (69, 4.0%) 



select regulatory molecule (88, 5.1%) 



transferase (70, 4.1%) 




synthase and synthetase (64, 3.7%) 

oxidoreductase (64. 3.7%) 

ryase(12,0.7%) 
ligasc(9,0.5%) 



molecular function unknown (613. 35.8%) 



hydrolase (80. 4.7%) 
isomcrascQI, 1.2%) 
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zymes are involved in intermediary metabo- 
lism. The only exception is the hydrolase 
category, which is not significantly overrep- 
Jgented in the shared protein set. Proteases 
A 1 the largest part of this category, and 
^Weral large protease families have expanded 
in each of these three organisms after their 
divergence. The category of select regulatory 
molecules is also overrepresented in the con- 
served set. The major conserved families are 
small guanosine triphosphatases (GTPases) 
(especially the Ras-related superfamily, in- 
cluding ADP ribosylation factor) and cell 
cycle regulators (particularly the cullin fam- 
ily, cyclin C family, and several cell division 
protein kinases). The last two significantly 
overrepresented categories are protein trans- 
port and trafficking, and chaperones. The 
most conserved groups in these categories are 
proteins involved in coated vesicle-mediated 
transport, and chaperones involved in protein 
folding and heat-shock response [particularly 
the DNAJ family, and heat-shock protein 
60 (HSP60), HSP70, and HSP90 families]. 
These observations provide only a conserva- 
tive estimate of the protein families in the 
context of specific cellular processes that 
were likely derived from the last common 
ancestor of the human, fly, and worm. As 
stated before, this analysis does not provide a 
complete estimate of conservation across the 
three animal genomes, as paralogous dupli- 
cagmi makes the determination of true or- 
difficult within the members of con- 
sHR protein families. 
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7.3 Differences between the human 
genome and other sequenced 
eukaryotic genomes 

To explore the molecular building blocks of 
the vertebrate taxon, we have compared the 
human genome with the other sequenced 
eukaryotic genomes at three levels: molec- 
ular functions, protein families, and protein 
domains. 

Molecular differences can be correlated 
with phenotypic differences to begin to reveal 
the developmental and cellular processes that 
are unique to the vertebrates. Tables 18 and 
19 display a comparison among all sequenced 
eukaryotic. genomes, over selected, protein/ 
domain families (defined by sequence simi- 
larity, e.g., the serine-threonine protein ki- 
nases) and superfamilies (defined by shared 
molecular function, which may include sev- 
eral sequence-related families, e.g., the cyto- 
kines). In these tables we have focused on 
(super) families that are either very large or 
that differ significantly in humans compared 
with the other sequenced eukaryote genomes. 
We have found that the most prominent hu- 
man^oansions are in proteins involved in (i) 
ac €^Binunune functions; (ii) neural devel- 
opnMT structure, and functions; (iii) inter- 
cellular and intracellular signaling pathways 



in development and homeostasis; (iv) hemo- 
stasia and (v) apoptosis. 

Acquired immunity. One of the most 
striking differences .between the human ge- 
nome and the Drosophila or C. elegans ge- 
^nome is the appearance of genes involved in 
acquired immunity (Tables 18 and 19). This 
is expected, because the acquired immune 
response is a defense system that only occurs 
in vertebrates. We observe 22 class I and 22 
class II /majorihistocompatibility complex . 
: (MHC) antigen genes and 114 other, immu- 
noglobulin genes in> the ;human -genome. In, 
addition, there are 59 genes in the . cognate 
immunoglobulin receptor family. At the do- 
main level, this is exemplified by an expan- 
sion and recruitment of the ancient immuno- 
globulin fold to constitute molecules such as ; 
MHC, and of the integrin fold to form several 
of the cell adhesion molecules that mediate . 
interactions between immune effector cells 
and the extracellular matrix. Vertebrate-spe- 
cific proteins, include the paracrine immune 
regulators family of secreted 4-alpha helical 
. bundle proteins, namely the cytokines . and 
chemokines.-Some of the cytoplasmic signal 
transduction components associated with cy- 
tokine receptor signal, transduction, are also 
features that are poorly represented in the fly 
and worm. These include protein domains 
found in the signal transducer and activator of 
transcription (STATs), the suppressors of cy- 
tokine signaling (SOCS), and protein inhibi- 
tors of activated STATs (PIAS). In contrast, 
many of the animal-specific protein domains 
that play a role in innate immune response, 
such as the Toll receptors, do not appear to be 
significantly expanded in the human genome. 

Neural development, structure, and 
function. In the human genome, as compared 
with the worm and fly genomes, there is a 
marked increase in the number of members 
of protein families .that .are involved in 
neural development. Examples include neu- 
rotrophic factors such as ependymin, nerve 
growth factor, and signaling molecules 
such as semaphores, as well as the number 
of proteins involved directly in neural 
structure and function such as myelin pro- 
teins, voltage-gated ion channels, and syn- 
aptic proteins such as synaptotagmin. 
These observations correlate well with the 
known phenotypic differences between the 
nervous systems of these taxa, notably (i) 
the increase in the number and connectivity 
of neurons; (ii) the increase in number of 
distinct neural cell types (as many as a 
thousand or more in human compared with 
a few hundred in fly and worm) (121); (iii) 
the increased length of individual axons; 
and (iv) the significant increase in glial cell 
number, especially the appearance of my- 
elinating glial cells, which are electrically 
inert supporting cells differentiated from 
the same stem cells as neurons. A number 



of prominent protein expansions are in- 
volved in the processes of neural develop- 
ment; Of the extracellular domains that me- 
diate cell adhesion, the connexin domain- 
containing proteins (122) exist only in hu- 

• mans. These proteins, which are not present 
in the Drosophila or C. elegans genomes, 
appear to provide the constitutive subunits 

• of intercellular channels and the structural 

• basis for electrical coupling. Pathway find- 

• , ing by;axons and neurohal -network forma- 
tion is mediated through a subset of ephrins 
and their cognate receptor tyrosine kinases 
that act as positional labels to establish 
topographical projections (123). The prob- 
able biological role for the semaphores (22 
in human compared with 6 in the fly and 2 

. in the worm) and their receptors . (neuropi- 
lins and plexins) is that of axonal guidance 
molecules (124). Signaling molecules such 
as neurotrophic factors and some cytokines 
.have been shown to regulate neuronal cell 
survival, proliferation, and axon guidance 
(125). Notch receptors and ligands play 
important roles in glial cell fate determina- 
tion and gliogenesis (126). 

Other human expanded gene families play 
key roles directly in neural structure and 
function. One example is synaptotagmin (ex- 
panded more than twofold in humans relative 
to the invertebrates), originally found to reg- 
ulate synaptic transmission by serving as a 
Ca 2+ sensor (or receptor) during synaptic 
.; vesicle fusion and release (127). Of interest is 
the increased co-occurrence in humans of 
PDZ and the SH3 domains in neuronal- 
specific adaptor molecules; examples include 
. proteins that likely modulate channel activity 
at synaptic junctions (128). We also noted 
expansions in several ion-channel families 
(Table 19), including the EAG subfamily 
.: (related to cyclic nucleotide gated channels), 
. the voltage-gated calcium/sodium channel 
family,^ the inward-recrifier potassium chan- 
nel family, and the. voltage-gated potassium 
channel, alpha subunit family. Voltage-gated 
sodium and potassium channels are involved 
in the generation of action potentials in neu- 
rons. Together with voltage-gated calcium 
channels, they also play a key role in cou- 
pling action potentials to neurotransmitter re- 
lease, in the development of neurites, and in 
short-term memory. The recent observation 
of a calcium-regulated association between 
sodium channels and synaptotagmin may 
have consequences for the establishment and 
regulation of neuronal excitability (129). 

Myelin basic protein and myelin-associat- 
ed glycoprotein are major classes of protein 
components in both the central and peripheral 
nervous system of vertebrates. Myelin P0 is a 
major component of peripheral myelin, and 
myelin proteolipid and myelin oligodendro- 
cyte glycopotein are found in the central 
nervous system. Mutations in any of these 
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Table 18. Domain-based comparative analysis of proteins in H. sapiens (H) 
D. melanogaster (F), C etegans (W), S. cerevisiae (Y), and 4. thaliana (A). The 
predicted protein set of each of the above eukaryotic organisms was analyzed 
with Pfam version 5.5 using E value cutoffs of 0.001. The number of proteins . 
. containing the specified Pfam domains as well as the total number of domains 
(in parentheses) are shown in each column. Domains were categorized into 
cellular processes for presentation. Some domains (i.e., SH2) are listed in 



. more than one cellular process. Results of the Pfam analysis may differ from 
-results obtained based on human curation of protein families, owing to the 
limitations of large-scale automatic classifications. Representative examples 
of domains with reduced counts owing to the stringent E value cutoff used for 
this analysis are marked with a double asterisk (**). Examples include short 
divergent and predominantly alpha-helical domains, and certain classes of 
cysteine-rich zinc finger proteins. 



Accession 
number 



Domain name 



Domain description 



H 



W 



PF02039 
PF00212 
PF00028 
PF00214 
PF01110 
PF01093 
PF00029 
PF00976 
PF00473 
PF00007 
PF00778 
PF00322 
PF00812 
PF01404 
PF00167 
PF01534 
PF00236 
PF01153 
PF01271 
PF02058 
PF00049 
PF00219 
PF02024 
PF00193 
PF00243 
PF02158 
PF00184 
PF02070 
PF00066 
PF00865 
PF00159 
PF01279 
PF00123 
PF00341 
PF01403 
PF01033 
PF00103 
PF02208 
PF02404 
PF01034 
PF00020 
PF00019 
PF01099 
PF0H60 
PF00110 

PF01821 

PF00386 

PF00200 

PF00754 

PF01410 

PF00039 

PF00040 

PF00051 

PF01823 
PF00354 
PF00277 
PF00084 
PF02210 
PF01108 
PF00868 
PF00927 



Adrenomedullin 
ANP 
Cadherin 
Calc_CCRP IAPP 
CNTF 
.Clusterin 
Connexin 
ACTH_domain 
CRF 

Cys_knot 
DIX 

Endothelin 
Ephrin 
EPhJbd 
FCF 
Frizzled 
Hormone6 
Clypican 
Cranin 
Guanylin 
insulin 
ICFBP 
Leptin 
Xlink 
NCF 

Neuregulin 
Hormone5 
NMU 
Notch 

Osteopontin 
Hormone3 
Parathyroid 
Hormone2 
PDCF 
Sema 

Somatomedin.B 
Hormone 
Sorb 
SCF 

Syndecan 
TNFR_c6 
TGF-p 
Uteroglobin 
Opiods_neuropep 
Wnt 

ANATO 
Clq 

Disintegrin 
F5_F8_type_C 
COLFI 
Fn1 
Fn2 
Kringle 
MACPF 
Pentaxin 
SAA_proteins 
Sushi 
TSPN 
Tissuejac 
Transglutamin.N 
Transglutamin_C 



Developmental and homeostatic 

Adrenomedullin 
Atrial natriuretic peptide 
Cadherin domain 
Calcitonin/CGRP/IAPP family 
. Ciliary neurotrophic factor 
Clusterin 
Connexin 

Corticotropin ACTH domain 

Corticotropin-releasing factor family 

Cystine-knot domain 

Dix domain 

Endothelin family 

Ephrin 

Ephrin receptor ligand binding domain 
Fibroblast growth factor 
Frizzled/Smoothened family membrane region 
Glycoprotein hormones 
Glypican 

Grainin (chromogranin or secretogranin) 

Guanylin precursor 

Insulin/IGF/Relaxin family 

Insulin-like growth factor binding proteins 

Leptin 

LINK (hyaluron binding) 
Nerve growth factor family 
Neuregulin family 
Neurohypophysial hormones 
Neuromedin U 
Notch (DSL) domain 
Osteopontin 

Pancreatic hormone peptides 
Parathyroid hormone family 
Peptide hormone 

Platelet-derived growth factor (PDGF) 
Sema domain 
Somatomedin B domain 
Somatotropin 

Sorbin homologous domain 

Stem cell factor 

Syndecan domain 

TNFR/NGFR cysteine-rich region 

Transforming growth factor p-like domain 

Uteroglobin family 

Vertebrate endogenous opioids neuropeptide 
Wnt family of developmental signaling proteins 

Hemostasis 

Anaphylotoxin-like domain 

C1q domain 

Disintegrin 

F5/8 type C domain 

Fibrillar collagen C-terminal domain 

Fibronectih type I domain , 

Fibronectin type II domain 

Kringle domain 

MAC/Perforin domain 

Pentaxin family 

Serum amyloid A protein 

Sushi domain (SCR repeat) 

Thrombospondin N-terminaWike domains 

Tissue factor 

Transglutaminase family 

Transglutaminase family 



regulators 

1 
2 

100(550) 
3 
1 
3 

. 14(16) 
1 
2 

10(11) 
5 

7(8) 
12 
23 
9 
1 
14 
3 
1 
7 
10 

13(23) 
3 
4 
1 

3(5) 
1 
3 

5(9) 
5 

27(29) 
5(8) 
1 
2 
2 

17(31) 
27(28) 

3 

3 
18 



0 
0 

14(157) 
0 
0 
0 
0 
0 
1 
2 
2 
0 

2 ' 

2 

1 

7 

0 

2 

0 

0 

4' 

0 

0 

0 

0 

0 

0 

0 

2(4) 
0 
0 
0 
0 

1 

8(10) 
3 
0 
0 
0 
1 
1 
6 
0 
0 

7(10) 



0 
0 

16(66) 
0 
0 
0 
0 
0 
0 
0 
4 
0 
4 
1 
1 
3 
0 
1 
0 
0 
0 
0 
0 

1 

0 
0 
0 
0 

2(6) 
0 
0 
0 
0 
0 

3(4) 
0 
0 
0 
0 

1 

0 
4 
0 
0 
5 



0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 



6(14) 


0 


0 


0 


24 


0 


0 


0 


18 




3 


0 


15(20) 


5(6) 


2 


0 


10 


0 


0 


0 


5(18) , 


0 


0 


0 


11(16) 


0 


0 


0 


15(24) 


2 


2 


0 


6 


0 


0 


0 


9 


0 


0 


0 


4 


0 


0 


0 


53 (191) 


11(42) 


8(45) 


0 


14 


1 


0 


0 


1 


0 


0 


0 


6 


1 


0 


0 


8 


1 


0 


0 



0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0' 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 
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Table 18 (Continued) 



Accession 
number 



Domain name 



G(a 



PF00711 
PF00748 
PF00666 
PF00129 

PF00993 
PF00969 
PF00879 
PF01109 
PF00047 
PF00143 
PF00714 
PF00726 
PF02372 
PF00715 
PF00727 
PF02025 
PF01415 
PF00340 
PF02394 
PF02059 
PF00489 
PF01291 

PF00323 
PF01091 
PF00277 
PF00048 




PF00779 
PF00168 
PF00609 
PF00781 
PF00610 

PF01363 
PF00996 
PF00503 
PF00631 
PF00616 
PF00618 

PF00625 
PF02189 
PF00169 
PF00130 

PF00388 

PF00387 

PF00640 
PF02192 
PF00794 
PF01412 
PF02196 
PF02145 
PF00788 
PF00071 

W 

PHJIt97 



Defensin_beta 
Calpain_inhib 
Cathelicidins 
MHCJ 

MHCJLalpha** 
MHCJLbeta** 
Defensin^propep 
GM.CSF 

te 

Interferon 
IFN-gamma 
IL10 
IL15 
IL2 
IL4 
IL5 
IL7 
IL1 

IL1_propep 
IL3 
IL6 

LIF.OSM 

Defensins 
PTN.MK 
SAA4>roteins 
IL8 

TIR 
TNF 
Trefoil 

BTK 
C2 

DACKa 
DAGKc 
DEP 

FYVE 
GDI 

G-alpha . 
G-gamma 
RasGAP 
RasGEFN 

Guanylatejcin 

ITAM 

PH 

DAG.PE-bind 

PI-PLC-X 

PI-PLC-Y 



PID 

PI3K_p85B 
PI3K_rbd 
AlfGAP 
RBD 

Rap.GAP 

RA 

Ras 

RasGEF 

RGS 

Rita 



Domain description 

Vitamin K-dependent carboxylation/gamma 
carboxyglutamic (GLA) domain 

, m Immune response 

Beta defensin 

Calpain inhibitor repeat 

Cathelicidins 

Class I histocompatibility antigen, domains alpha 1 

and 2 , 
Class II histocompatibility antigen, alpha domain 
Class II histocompatibility antigen, beta domain 
Defensin propeptide 

Granulocyte-macrophage colony-stimulating factor 

Immunoglobulin domain 

Interferon alpha/beta domain 

Interferon gamma 

lnterleukin-10 

lnterleukin-15 

lnterleukin-2 

lnterleukin-4 

lnterleukin-5 

lnterteukin-7/9 family 

lnterteukin-1 

lnterleukin-1 propeptide 

lnterleukin-3 

lnterleukin-6/G-CSF/MGF family 

Leukemia inhibitory factor (LIF)/oncostatin (OSM) 

family 
Mammalian defensin 
PTN/MK heparin-binding protein 
Serum amyloid A protein 
Small cytokines (intecrine/chemokine), 

interleukin-8 like 
TIR domain 

TNF (tumor necrosis factor) family 
Trefoil (P-type) domain 

PI-PY-rho CTPase signaling 
BTK motif * 
C2 domain 

Diacylglycerol kinase accessory domain (presumed) 
Diacylglycerol kinase catalytic domain (presumed) 
Domain found in Dishevelled, Egl-10, and 

Pleckstrin (DEP) 
FYVE zinc finger 
GDP dissociation inhibitor 
G-protein alpha subunit 
G-protein gamma like domains 
GTPase-activator protein for Ras-like GTPase 
Guanine nucleotide exchange factor for Ras-like 

GTPases; N-terminal motif 
Guanylate kinase 

Immunoreceptor tyrosine-based activation motif 
PH domain 

Phorbol esters/diacylglycerol binding domain (CI 
domain) 

Phosphatidylinositol-specific phospholipase C, X 
domain 

Phosphatidylinositol-specific phospholipase C, Y 
domain 

Phosphotyrosine interaction domain (PTB/PID) 
PI3-kinase family. p85-binding domain 
PI3-kinase family, ras-binding domain 
Putative GTP-ase activating protein for Arf 
Raf-like Ras-binding domain 
Rap/ran-GAP 

Ras association (RalGDS/AF-6) domain 
Ras family 
RasGEF domain 

Regulator of G protein signaling domain 
Regulatory subunit of type II PKA R-subunit 



11 



1 

3(9) 
2 

18(20) 

5(6) 
7 
3 
1 

381 (930J 
7(9) 
1 
1 
1 
1 
1 
1 
1 
7 
1 
1 
2 
2 

2 
2 
4 
32 

18 
12 
5(6) 

5 

73(101) 
9 
10 
12(13) 

28(30) 
6 

27(30) 
16 
11 
9 

12 
3 

193(212) 
45(56) 

12 

11 

24(27) 
2 
6 
16 
6(7) 
5 

18(19) 
126 
21 
27 
4 



0 
0 
0 

0 

0 
0 
0 
0 

125 (291) 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 



w 



0 
0 
0 
0 

0 
0 
0 
0 

67(323) 
0 
0 
0 
0 
0 
0 
0 
0 

■ 0 
0 
0 
0 
0 



0 
0 
0 
0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 
0 
0 
0 
0 
0 
0 
0 



0 
0 
0 
0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 
0 
0 
0 
0 
0 
0 
0 



0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


8 


2 


0 


131 (143) 


0 


0 


b 


0 


0 


2 


0 


0 


1 


0 


0 


0 


32(44) 


24(35) 


6(9) 


66(90) 


4 


7 


0 


6 


8 


8 


2 


11(12) 


4 


10 


5 


2 


14 


15 


5 


15 


2 


1 


1 


3 


10 


20(23) 


2 


5 


5 


5 


1 


0 


5 


8 


3 


0 


2 


3 


5 


0 


8 


7 


1 


4 


0 


0 


0 


0 


72(78) 


65(68) 


24 


23 


25(31) 


26(40) 


1(2) i 


4 


3 


7 


1 


8 


2 


7 


1 


8 


13 


11(12) 


0 


0 


1 


1 


0 


0 


3 


1 


0 


0 


9 


8 


6 


15 


4 


1 


0 


0 


4 


2 


0 


0 


7(9) 


6 


1 


0 


56(57) 


51 


23 


78 


8 




5 


0 


6(7) 


12(13) 


1 


0 


1 


2 


1 


0 
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Table 18 (Continued) 



Accession 
number 



Domain name 



THE HUMAN GENOME 



Domain description 



PF00620 
PF00621 
PF00536 
PF01369 
PF00017 
PF00018 
PF01017 
PF00790 
PF00568 



Pi 

k 



PF00452 
PF02180 
PF00619 
PF00531 
PF01335 
PF02179 
PF00656 
PF00653 



PF00022 

PF00191 

PF00402 

PF00373 

PF00880 

PF00681 

PF00435 

PF00418 

PF00992 

PF02209 

PF01044 

PF01391 
PF01413 

PF00431 
PF00008 
PF00147 

PF00041 
PF00757 
PF00357 
PF00362 
PF00052 
PFO0053 
PF00054 
PF00055 
PF00059 
PF01463 
PF01462 
PF00057 
PF00058 
PF00530 
PF00084 
PF00090 
PF00092 
PF00093 
PF00094 

PF00244 
PF00023 
PF00514 
PF00168 
PF00027 
>F01556 
'F00226 
PF00036 
PF00611 
PF01846 
PF00498 



RhoCAP 
RhoCEF 
SAM 
Sec7 
SH2 
SH3 
STAT 
VHS 
WH1 

Bcl-2 

BH4 

CARD 

Death 

DED 

BAG 

ICE_p20 

BIR 

Actin 

Annexin 

Calponin 

Band_41 

Nebulin_repeat 

Plectin_repeat 

Spectrin 

Tubulin-binding 

Troponin 

VHP 

Vinculin 

Collagen 
C4 

CUB 
EOF 

Fibrinogen^ 



Fn3 

Furin-like 
lntegrin_A 
lntegrin_B 
LamininJJ 
Laminin_EGF 
Laminin_C 
Laminin_Nterm 
Lectin_c 
LRRCT 
LRRNT 
LdLrecept.a 
Ldl_recept_b 
SRCR 
Sushi 
Tsp.l 
Vwa 
Vwc 
Vwd 

14-3-3 
Ank 

Armadillo_seg 
C2 

cNMP_binding 
DnaJ_C 
DnaJ 
Efhand** 
FCH 
FF 
FHA 



W 



RhoCAP domain 
RhoCEF domain 

SAM domain (Sterile alpha motif) 
Sec7 domain 

Src homology 2 (SH2) domain 
Src homology 3 (SH3) domain 
STAT protein 
VHS domain 
WH1 domain 

B ^ 2 Domains involved in apoptosis 

Bcl-2 homology region 4 
Caspase recruitment domain 
Death domain 
Death effector domain 
Domain present in Hsp70 regulators 
ICE-like protease (caspase) p20 domain 
Inhibitor of Apoptosis domain 

A . . Cytoskeletai 

Actin 

Annexin 

Calponin family 

FERM domain (Band 4.1 family) 
Nebulin repeat 
Plectin repeat 
Spectrin repeat 

Tau and MAP proteins, tubulin-binding 
Troponin 

Villin headpiece domain 
Vinculin family 

_ „ , ECM adhesion 

Collagen triple helix repeat (20 copies) 
C-terminal tandem repeated domain in type 4 

procollagen 
CUB domain 
ECF-like domain 

Fibrinogen beta and gamma chains, C-terminal 

globular domain 
Fibronectin type ill domain 
Furin-like cysteine rich region 
Integrin alpha cytoplasmic region 
Integrins, beta chain 
Laminin B (Domain IV) 
Laminin ECF-like (Domains llland V) 
Laminin G domain 
Laminin N-terminal (Domain VI) 
Lectin C-type domain 
Leucine rich repeat C-terminal domain 
Leucine rich repeat N-terminal domain 
Low-density lipoprotein receptor domain class A 
Low-density lipoprotein receptor repeat class B 
Scavenger receptor cysteine-rich domain 
Sushi domain (SCR repeat) 
Thrombospondin type 1 domain 
von Willebrand factor type A domain 
von Willebrand factor type C domain 
von Willebrand factor type D domain 

- Protein interaction domains 

14-3-3 proteins 
Ank repeat 

Armadillo/beta-catenin-like repeats 
C2 domain . 

Cyclic nucleotide-binding domain 
DnaJ C terminal region 
DnaJ domain 
EF hand 

Fes/CIP4 homology domain 
FF domain 
FHA domain 



59 
46 
29 (31) 
13 

87(95) 
143(182) 
7 
4 
7 

9 
3 
16 
16 
4(5) 
5(8) 
11 
8(14) 

61(64) 
16(55) 
13(22) 
29 (30) 
4(148) 
2(11) 
31 (195) 
4(12) 
4 
5 
4 



19 

23(24) 
15 
5 

33(39) 
55(75) 
1 
2 
2 

2 
0 
0 
5 
0 
3 
7 

5(9) 

15(16) 
4(16) 
3 

17(19) 
1(2) 
0 

13(171). 
1(4) 
6 
2 
2 



65 (279) 
6(11) 

47(69) 
108(420) 
26 

106 (545) 
5 
3 
8 

8(12) 
24(126) 
30(57) 
10 
47(76) 
69 (81) 
40(44) 
35(127) 
15(96) 
11(46) 
53 (191) 
41 (66) 
34(58) 
19(28) 
15(35) 



20 

145 (404) 
22 (56) 
73(101) 
26(31) 
12 
44 

83(151) 
9 

4(11) 
13 



10(46) 
2(4) 

9(47) 
45 (186) 
10(11) 

42 (168) 
2 
1 
2 

4(7) 
9(62) 
18(42) 
6 

23(24) 
23 (30) 
7(13) 
33 (152) 
9(56) 
4(8) 
11(42) 
11(23) 
0 

6(11) 
3(7) 

3 

72 (269) 
11(38) 
32 (44) 
21 (33) 
9 
34 

64(117) 
3 

4(10) 
15 



20 
18(19) 
8 
5 

44(48) 
46(61) 

1(2) 
4 

2(3) 



34(156) 
1 
2 
2 

6(10) 
11(65) 
14(26) 
4 

91 (132) 
7(9) 
3(6) 
27(113) 
7(22) 
1(2) 
8(45) 
18(47) 
17(19) 

2(5) 
9 

3 

75 (223) 
3(11) 
24(35) 
15(20) 
5 
33 
41 (86) 
2 

3(16) 
7 



9 
3 
3 
5 
1 

23(27) 
0 
4 
1 



0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 



8 
0 
6 
9 
3 
4 
0 
8 
0 



1 


0 


0 


1 


0 


n 


2 


0 


o 


7 


0 


0 


0 


0 


0 


2 


1 


5 


3 


0 


0 


2(3) 


K2) 


0 


12 


9(11) 


24 


4(11) 


0 


6(16) 


7(19) 


0 


0 


11(14) 


0 


0 


1 


0 . 


. 0 


0 


0 


0 


10(93) 


0 


0 


2(8) 


0 


0 


8 


0 


0 


2 


0 


5 


1 


0 


0 


174(384) 


0 


0 


3(6) 


0 


0 


43(67) 


0 


0 


54(157) 


0 


1 


6 


0 


0 



1 

0 
0 
0 
0 
0 
0 
0 
0 

0 

0 

0 

0 

0 

0 

0 

1 

0 
0 



2 

12(20) 
2(10) 
6(9) 
2(3) 
3 
20 
4(11) 
4 

2(5) 
13(14) 



15 

66(111) 
25(67) 
66 (90) 
22 
19 
93 

120 (328) 
0 

4(8) 
17 
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. myelin proteins result in severe demyelina- 
tion, which is a pathological condition in 
which the myelin is lost and the nerve con- 
duction is severely impaired (130). Humans 

•e at least 10 genes belonging to four, 
erent families involved in myelin produc- 

Table 18 {Continued) 



the Human genome 

tion (five myelin P0, three myelin proteolip- 
id, myelin basic protein, and myelin-oligo- 
dendrocyte glycoprotein, or MOG), and pos- 
sibly more-remotely. related members of the 
MOG family. Flies have only a single myelin 
proteolipid, and worms have none at all. 



Intercellular and intracellular signaling 
pathways in development and homeostasis. 
Many protein families that have expanded in 
humans relative to the invertebrates are in- 
volved in signaling processes, particularly in 
response to development and differentiation 



Accession 
number . 



Domain name 



Domain description 



H 



W 



PF00254 

PF01590 

PF01344 

PF00560 

PF00917 

PF00989 

PF00595 

PF00169 

PF01535 

PF00536 

PF01369 

PF00017 

PF00018 

PF0174O 

PF00515 

PF00400 

PF00397 

PF00569 

PF01754 
PF01388 
PF01426 
PF00643 
PF00533 

PF00385 

PF00125 

PF00134 

PF00270 

PF01529 

PF00646 

PF00250 

PF00320 

PF01585 

PF00010 

PF00850 

PF00046 

PF01833 

PF02373 

PF02375 

PF00013 

PF01352 

PF00104 

PF00412 
PF00917 
PF00249 
PF02344 
PF01753 
PF00628 
PF00157 
PF02257 
PF00076 

PF02037 
Pi 



FKBP 
CAF 
Kelch 
LRR** 
MATH 
PAS 
PDZ 
PH 

PPR** 

SAM 

Sec7 

SH2 

SH3 

STAS 

TPR** 

WD40** 

WW 

ZZ 

Zf-A20 

ARID 

BAH 

Zf-B_box** 
BRCT 

Bromodomain 
BTB 

DNA_methylase 
Chromo 

Histone 

Cyclin 

DEAD 

Zf-DHHC 

F-box** 

Forehead 

CATA 

G-patch 

HLH** 

Hist.deacetyl 

Homeobox 

TIC 

JmjC 

JmjN 

KH-domain 
KRAB 

Hormone.rec 

UM 
MATH 

Myb.DNA-binding 

Myc-LZ 

Zf-MYND 

PHD 

Pou 

RFXJ)NAJ>inding 
Rrm 

SAP 
SPRY 
START 
T-box 



FKBP-type peptidyt-prolyl cis-trans isomerases 
CAF domain 
Kelch motif 
Leucine Rich Repeat 
MATH domain 
PAS domain 

PDZ domain (Also known as DHR or CLGF) 
PH domain 
PPR repeat 

SAM domain (Sterile alpha motif) 
Sec7 domain 

Src homology 2 (SH2) domain 
Src homology 3 (SH3) domain 
STAS domain 
TPR domain 
WD40 domain 
WW domain 

ZZ-Zinc finger present in dystrophin, CBP/p30O 

Nuclear interaction domains 

A20-like zinc finger 
ARID DNA binding domain 
BAH domain 
B-box zinc finger 

BRCA1 C Terminus (BRCT) domain 
Bromodomain 
BTB/POZ domain 

C-5 cytosine-specific DNA methylase 
chromo' (CHRromatin Organization Modifier) 
domain 

Core histone H2A/H2B/H3/H4 
Cyclin 

DEAD/DEAH box helicase 
DHHC zinc finger domain 
F-box domain 
Fork head domain 
CATA zinc finger 
G-patch domain 

Helix-loop-helix DNA-binding domain 
Histone deacetylase family 
Homeobox domain 
IPT/TIC domain 
JmjC domain 
JmjN domain 
KH domain 
KRAB box 

Ligand-binding domain of nuclear hormone 

receptor 
LIM domain containing proteins 
MATH domain 

Myb-like DNA-binding domain 
Myc leucine zipper domain 
MYND finger 
PHD-finger - 

Pou domain — N-terminal to homeobox domain 
RFX DNA-binding domain 
RNA recognition motif (a.k.a. RRM, RBD, or RNP 

domain) 
SAP domain 
SPRY domain 
START domain 
T-box 



15(20) 
7(8) 
54(157) 
25(30) 
11 

18(19) 
96(154) 
193(212) 
5 

29(31) 
13 

87(95) 
143 (182) 
5 

72(131) 
136 (305) 
32(53) 
10(11) 

2(8) 
11 
8(10) 
32(35) 
17(28) 
37(48) 
97(98) 
3(4) 
24(27) 

75(81) 
19 
63 (66) 
15 
16 
35(36) 
11(17) 
18 

60(61) 
12 

160 (178) 
29(53) 
10 
7 

28(67) 
204(243) 
47 

62 (129) 
11 

32(43) 
1 
14 
68(86) 
15 
7 

224(324) 
15 

44(51) 
10 
17(19) 



7(8) 
2(4) 
12(48) 
24(30) 
5 

9(10) 
60(87) 
72(78) 
3(4) 
15 
5 

33(39) 
55(75) 
1 

39(101) 
98(226) 
24(39) 
13 

2 
6 

7(8) 
1 

10(18) 
16(22) 
62 (64) 
1 

14(15) 

5 
10 
48(50) 
20 
15 

20(21) 
5(6) 
16 
44 
5(6) 
100(103) 
11(13) 
4 
4 

14(32) 
0 
17 

33(83) 
5 

18(24) 
0 
14 

40(53) 
5 
2 

127(199) 
8 

10(12) 
2 
8 



7(13) 
1 

13(41) 
7(11) 
88(161) 
6 

46(66) 
65 (68) 
0 
8 
5 

44(48) 
46(61) 
. 6 
28(54) 



4 
0 
3 
1 
1 
1 
2 
24 
1 
3 
5 
1 

23(27) 
2 

16(31) 



72(153) 56(121) 
16(24) 5(8) 
10 2 



2 
4 

4(5) 

23(35) 
18(26) 
86(91) 
0 

17(18) 

71(73) 
10 

55(57) 
16 

309 (324) 
15 
8(10) 
13 
24 
8(10) 
82 (84) 
5(7) 
6 
2 

17(46) 
0 

142(147) 

33(79) 
88 (161) 
17(24) 
0 
9 

32(44) 
4 
1 

94(145) 
5 

5(7) 
6 
22 



0 
2 
5 
0 

10(16) 
10(15) 
1(2) 
0 

1(2) 

8 
11 

50(52) 
7 
9 
4 
9 
4 
4 
5 
6 
2 
4 
3 

4(14) 
0 
0 

4(7) 
1 

15(20) 
0 
1 

14(15) 
0 
1 

43(73) 



24(29) 
10 

102 (178) 
15(16) 
61 (74) 
13(18) 
5 
23 

474(2485) 
6 
9 
3 
4 
13 

65(124) 
167(344) 
11(15) 
10 

■ 8 
7 

21 (25) 
0 

12(16) 
28 
30(31) 
13(15) 
12 

48 
35 
84(87) 
22 

165(167) 
0 
26 ' 
14(15) 
39 
10 
66 
1 
7 
7 

27(61) 
0 
0 

10(16) 
61 (74) 
243 (401) 
0 
7 

96 (105) 
0 
0 

232 (369) 



5 6(7) 

3 6 

0 23 

0 0 
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Table 18 {Continued) 



The Human Genome 



Accession 
number 

PF02135 
PF01285 
PF02176 
PF00352 

PF00567 
PF00642 
PF00096 
PF00097 
PF00098 



Domain name 

Zf-TAZ 
TEA 

Zf-TRAF 
TBP 

TUDOR 
Zf-CCCH 
Zf-C2H2** 
Zf-C3HC4 
Zf-CCHC 



Domain description 

TAZ finger 
TEA domain 
TRAF-type zinc finger 

Transcription factor TFIID (or TATA-binding 

protein, TBP) 
TUDOR domain 

Zinc finger C-x8-C-x5-C-x3-H type (and similar) 

Zinc finger, C2H2 type 

Zinc finger, C3HC4 type (RING finger) 

Zinc knuckle 



H 

T(3) 
4 

6(9) 
2(4) 

9(24) 
17(22) 
564(4500) 
135 (137) 
9(17) 



1(2) 
1 

1(3) 
4(8) 

9(19) 
6(8) 
234(771) 
57 
6(10) 



w 


Y 


6(7) 


0 


1 


1 


1 


0 


2(4) 


1(2) 


4(5) 


0 


22(42) 


3(5) 


68(155) 


34(56) 


88 (89) 


18 


17(33) 


7(13) 



10(15) 
C 

2(4) 
2 

31(46) 
21 (24) 
298 (304) 
68(91) 



(Tables 18 - and 19). They include secreted 
hormones and growth factors, receptors, in- 
tracellular signaling molecules, and transcrip- 
tion factors. 

Developmental signaling molecules that are, 
enriched in the human genome include growth 
factors such as wnt, transforming growth fac- 
tar-0 (TGF-0), fibroblast growth factor (FGF), 
nerve growth factor, platelet derived growth 
factor (PDGF), and ephrins. These growth fac- 
tors affect tissue differentiation and a wide 
range of cellular processes involving actin-cy- 
toskeletal and nuclear regulation. The corre- 
sponding receptors of these developmental li- 
gands are also expanded in humans. For exam- 
ple, our analysis suggests at least 8 human 
epbrin genes (2 in the fly, 4 in the worm) and 1 2 
ephrin receptors (2 in the fly, 1 in the worm). In 
the wnt signaling pathway, we find 18 wnt 
family genes (6 in the fly, 5 in the worm) and 
12 frizzled receptors (6 in the fly, 5 in the 
worm). The Groucho family of transcriptional 
corepressors downstream in the wnt pathway 
are even more markedly expanded, with 13 
precb"ctedmemberemhumans(2inthefly, 1 in . 
the worm). 

Extracellular adhesion molecules involved 
in signaling are expanded in the human genome 
(Tables 18 and 19). The interactions of several 
of these adhesion domains with extracellular 
matrix proteoglycans play a critical role in host 
defense, morphogenesis, and tissue repair 
{131). Consistent with the well-defined role of 
heparan sulfate proteoglycans in modulating 
these interactions (752), we observe an expan- 
sion of the heparin sulfate sulfotransferases in 
the human genome relative to worm and fly. 
These sulfotransferases modulate tissue differ- 
entiation {133). A similar expansion in humans 
is noted in structural proteins that constitute the 
actin-cytoskeletal architecture. Compared with 
the fly and worm, we observe an explosive 
expansion of the nebulin (35 domains per pro- 
tein on average), aggrecan (12 domains per 
protein on average), and plectin (5 domains per 
protein on average) repeats in humans. These 
repeats are present in proteins involved in mod- 
ulating the actin-cytoskeleton with predominant 
expression in neuronal, muscle, and vascular 
tissues. 



.'■ Comparison across the five sequenced eu- 
karyotic organisms revealed several expand- 
ed protein families and domains involved in 
. cytoplasmic signal transduction (Table 18). 
In particular, signal transduction .pathways 
playing roles in developmental regulation and 
acquired immunity were substantially en- 
riched. There is a factor of 2 or greater ex- 
pansion in humans in the Ras superfamily 
GTPases and the GTPase activator and GTP 
exchange factors associated with them. Al- 
though there are about the same number of 
tyrosine kinases in the human and C elegans 
genomes, in humans there is an increase in 
the SH2, PTB, and ITAM domains involved 
in phosphotyrosine signal transduction. Fur- . 
ther, there is : a twofold expansion of phos- 
phodiesterases in the human genome com- 
pared with either the worm or fly genomes. 
The downstream effectors of the intracellu- 
w lar signaling molecules include the transcription, 
factors that transduce developmental fates. Sig- 
nificant expansions are noted in the ligand- - 
binding nuclear hormone receptor class of tran- 
- . scription factors compared with the.fly genome, 
although not to the extent observed in the worm 
(Tables 18 and 19). Perhaps the most striking 
expansion in humans is in the C2H2 zinc finger 
transcription factors. Pfam detects a total of 
4500 C2H2 zinc finger domains in 564 human 
proteins, compared with 771 in 234 fly proteins. 
This means that there has been a dramatic 
expansion not only in the number of C2H2 
transcription factors, but also in the number of 
these DNA-binding motifs per transcription 
. factor (8 on average in humans, 3.3 on average 
in the fly, and 2.3 on average in the worm). 
Furthermore, many of these transcription fac- 
tors contain either the KRAB or SCAN do- 
: mains, which are not found in the fly or worm 
genomes. These domains are involved in the 
oligomerization of transcription factors and in- 
crease the combinatorial partnering of these 
factors. In general, most of the transcription 
factor domains are shared between the three 
animal genomes, but the reassortment of these 
domains results in organism-specific transcrip- 
tion factor families. The domain combinations 
found in the human, fly, and worm include the 
BTB with C2H2 in the fly and humans, and 



homeodomains alone , or in combination with 
Pou and LIM domains in all of the animal 
genomes. In plants, however, a different set of 
transcription factors are expanded, namely, the 
: myb family, and a unique set that includes VP1 
- and AP2 domain-«x>ntaining proteins (134). 
...The yeast genome has a paucity of transcription 
factors compared ' with the multicellular eu- 
karyotes, and its repertoire is limited to the 
expansion of the yeast-specific C6 transcription 
: factor family involved in metabolic regulation! 
While we have illustrated expansions in a 
subset of signal transduction molecules in the 
human genome compared with the other eu- 
karyotic genomes, it should be noted that 
most of the protein domains are highly con- 
served. An interesting .observation is that 
worms and humans have approximately the 
.same number of both tyrosine kinases and 
serine/threonine kinases (Table 19). It is im- 
portant to note, however, that these are mere- 
ly counts of the catalytic domain; the proteins 
• that contain these domains also display a 
wide repertoire , of interaction domains with 
. significant combinatorial diversity. 
: i Hemostasis- Hemostasis is regulated pri- 
marily by plasma proteases of the coagulation 
pathway and by the interactions that occur be- 
tween the vascular endothelium and platelets. 
Consistent with known anatomical and physio- 
logical differences between vertebrates and in- 
vertebrates, extracellular adhesion domains that 
constitute proteins integral to hemostasis are 
expanded in the human relative to the fly and 
worm (Tables 18 and 19). We note the evolu- 
tion of domains such as FIMAC, FN1, FN2, 
and Clq that mediate surface interactions be- 
tween hematopoeitic cells and the vascular ma- 
trix. In addition, there has been extensive re- . 
cruitment of more-ancient animal-specific do- 
mains such as VWA, VWC, VWD, kringle, 
and FN3 into multidomain proteins that are 
involved in hemostatic regulation. Although we 
do not find a large expansion in the total num- 
ber of serine proteases, this enzymatic domain 
has been specifically recruited into several of 
these multidomain proteins for proteolytic reg- 
ulation in the vascular compartment These are 
represented in plasma proteins that belong to 
the kinin and complement pathways. There is a 
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significant expansion in two families of matrix 
metalloproteases: ADAM (a disintegrin and 
metalloprotease) and MMPs (matrix metalJo- 

« proteases) (Table 19). Proteolysis of extracel- 
fcdar matrix (ECM) proteins is critical for tissue 
^velopment and for tissue degradation in dis- 
eases such as cancer, arthritis, Alzheimer's dis- 
ease, and a variety of inflammatory conditions 
(755, 755). ADAMs are a family of integral 
membrane proteins with a pivotal role in fibrin- 
ogenolysis and modulating interactions be- 
tween hematopoietic components and the 
vascular matrix components. These proteins 
have been shown to cleave matrix proteins, 
and even signaling molecules: ADAM- 17 
converts tumor necrosis factor-a, and 
ADAM- 10 has been implicated in the Notch 
signaling pathway (135). We have identified 
19 members of the matrix metalloprotease 
family, and a total of 51 members of the 
ADAM and ADAM-TS families. 

Apoptosis. Evolutionary conservation of 
some of the apoptotic pathway components 
across eukarya is consistent with its central 
role in developmental regulation and as a 
response to pathogens and stress signals. The 
signal transduction pathways involved in pro- 
grammed ceil death, or apoptosis, are medi- 
ated by interactions between well-character- 
ized domains that include extracellular do- 
mains, adaptor (protein-protein interaction) 
domains, and those found in effector and 

«atory enzymes (757). We enumerated 
rotein counts of central adaptor and ef- 
r enzyme domains that are found only in 
the apoptotic pathways to provide an estimate 
of divergence across eukarya and relative 
expansion in the human genome when com- 
pared with the fly and worm (Table 18). 
Adaptor domains found in proteins restricted 
only to apoptotic regulation such as the DED 
domains are vertebrate-specific, whereas oth- 
ers like BIR, CARD, and Bcl2 are represent- 
ed in the fly and worm (although the number 
of Bcl2 family members in humans is signif- 
icantly expanded). Although plants and yeast 
lack the caspases, caspase-like molecules, 
namely the para- and meta-caspases, have 
been reported in these organisms (138). Com- 
pared with other animal genomes, the human 
genome shows an expansion in the adaptor 
and effector domain-containing proteins in- 
volved in apoptosis, as well as in the pro- 
teases involved in the cascade such as the 
caspase and calpain families. 

Expansions of other protein families. 
Metabolic enzymes. There are fewer cyto- 
chrome P450 genes in humans than in either 
the fly or worm. Lipoxygenases (six in hu- 
mans), on the other hand, appear to be specific 
to the vertebrates and plants, whereas the lip- 
oxj^aase-activating proteins (four in humans) 
vertebrate-specific. Lipoxygenases are 
d^Ped in arachidonic acid metabolism, and 
they and their activators have been implicated 
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in diverse human pathology ranging from 
allergic responses to cancers. One of the most 
surprising human expansions, however, is in 
the number .of glyceraldehyde-3 -phosphate 
dehydrogenase (GAPDH) genes (46 in hu- 
mans, 3. in the fly, and 4 in the worm). There 
is, however, evidence for many retrotrans- 



posed GAPDH pseudogenes (75P), which 
may account for this apparent expansion. 
However, it is interesting that GAPDH, long 
known as a conserved enzyme involved in 
basic metabolism found across all phyla from 
bacteria to humans, has recently been shown 
to have other functions. It has a second cat- 
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alytic activity, as a uracil DNA glycosylase 

(140) and functions as a cell cycle regulator < 

(14 1) and has even been implicated in apo- . 
ptosis (142). 

Translation. Another striking set of hu- 
man expansions has occurred in certain fam- 
ilies involved in the translational machinery. 
We identified 28 different ribosomal subunits 
-that each have at least 10 copies in the. ge- 
nome; on average, for all ribosomal proteins 
there is about an 8- to 10-fold expansion in 
the number of genes relative to either the 
worm or fly. Retrotransposed pseudogenes 

*■■ 

Table 19 (Continued) 
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may account for many of these expansions 
[see the discussion above and (143)]. Recent 
evidence suggests that a number of ribosomal 
proteins have secondary functions indepen- 
. dent of their involvement in protein biosyn- 
, thesis; for example, LI 3a and the related L7 
subunits (36 copies in humans) , have been 
shown to induce apoptosis (144). 

i There is also a four- to fivefold expansion . 
in the elongation factor 1 -alpha .family, 
(eEFIA; 56 human genes). Many of these 
expansions likely represent intronless para- 
logs that have presumably arisen from retro- 
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-.transposition, and again there is evidence that 
. many of these may be pseudogenes (145). 
However, a second form (eEFlA2) of this 
factor has been identied with tissue-specific 
expression in skeletal muscle and a comple- 
mentary expression pattern to the ubiquitous- 
ly expressed eEFIA (146). 

: 'Mbonucleoproteinsst Alternative: splicing 
-results- in. multiple/.transcriptsvirom.a. ; single 
gene, and can .therefore generate additional 
diversity in an organism's protein comple- 
ment. We have identified 269 genes for ri- 
bonucleoproteins. This represents over 2.5 
times the number of ribonucleoprotein genes 
in the worm, two times that of the fly, and 
about the same as the .265 identified in the 
Arabidopsis genome. Whether the diversity 
.. of ribonucleoprotein genes in humans con- 
tributes to gene regulation at either the splic- 
ing or translational level is unknown. 

Posttranslational modifications. In this 
set of processes, the most prominent expan- 
sion is the transglutaminases, calcium-depen- 
dent enzymes that catalyze the cross-linking 
of proteins in cellular processes such as he- 
mostasis and apoptosis (147). The vitamin 
K- dependent gamma carboxylase gene prod- 
uct acts' on the GLA domain (missing in the 
fly and worm) found in coagulation factors, 
osteocalcin, and matrix GLA protein (148). 
Tyrosylprotein sulfotransferases participate 
in the posttranslational modification of pro- 
teins involved in inflammation and hemosta- 
sis, including coagulation factors and chemo- 
kine receptors (149). Although there is no 
significant numerical increase in the counts 
for domains involved in nuclear protein mod- 
ification, there are a number of domain ar- 
rangements in the predicted human proteins 
that are not found in the other currently se- 
quenced genomes. These include the tandem 
association of two histone deacetylase do- 
mains in HD6 with a ubiquitin finger domain, 
a feature lacking in the fly genome. An ad- 
ditional example is the co-occurrence of im- 
portant nuclear regulatory enzyme PARP 
(poly-ADP ribosyl transferase) domain fused 
to protein-interaction domains— BRCT and 
VWA in humans. 

Concluding remarks. There are several 
possible explanations for the differences in 
phenotypic complexity observed in humans 
when compared to the fly and worm. Some of 
these relate tq the prominent differences in 
the immune system, hemostasis, neuronal, 
vascular, and cytoskeletal complexity. The 
finding that the human genome contains few- 
er genes than previously predicted might be 
compensated for by combinatorial diversity 
generated at the levels of protein architecture, 
transcriptional and translational control, post- 
translational modification of proteins, or 
posttranscriptional regulation. Extensive do- 
main shuffling to increase or alter combina- 
torial diversity can provide an exponential 
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. increase in the ability to mediate protein- 
protein interactions without dramatically in- 
creasing the absolute size of the protein com- 

•plement (J 50). Evolution of apparently new 
i the perspective of sequence analysis) 
in domains and increasing regulatory 
complexity by domain accretion both quanti- 
tatively and qualitatively (recruitment of nov- 
el domains with preexisting ones) are two 
features that we observe in humans. Perhaps 
the best illustration of this trend is the C2H2 
zinc finger-containing transcription factors, 
where we see expansion in the number of 
domains per protein, together with verte- 
brate-specific domains such as KRAB and 
SCAN. Recent reports on the prominent use 
of internal ribosomal entry sites in the human 
genome to regulate translation of specific 
classes of proteins suggests that this is an area 
that needs further research to identify the full 
extent of this process in the human genome 
{151). At the posttranslational level, although 
we provide examples of expansions of some 
protein families involved in these modifica- 
tions, further experimental evidence is re- 
quired to evaluate whether this is correlated 
with increased complexity in protein process- 
ing. Posttranscriptional processing and the 
extent of isoform generation in the human 
remain to be cataloged in their entirety. Given 
the conserved nature of the spliceosomal ma- 
chinery, further analysis will be required to 
dissect regulation at this level. 
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8.1 The whole-genome sequencing 
appr ach versus BAC by BAC 

Experience in applying the whole-genome 
shotgun sequencing approach to a diverse 
group of organisms with a wide range of 
genome sizes and repeat content allows us to 
assess its strengths and weaknesses. With the 
success of the method for a large number of 
microbial genomes, Drosophila, and now the 
human, there can be no doubt concerning the 
utility of this method. The large number of 
microbial genomes that have been sequenced 
by this method (15, 80, 152) demonstrate that 
megabase-sized genomes can be sequenced 
efficiently without any input other that the de 
novo mate-paired sequences. With more 
complex genomes like those of Drosophila or 
human, map information, in the form of well- 
ordered markers, has been critical for long- 
range ordering of scaffolds. For joining scaf- 
folds into chromosomes, the quality of the 
map (in terms of the order of the markers) is 
nore important than the number of markers 
*er se. Although this mapping could have 
*een performed concurrently with sequenc- 
ng, ti^j>rior existence of mapping data was 
^jj^m Durm g toe sequencing of the A. 
haH^Fgenome, sequencing of individual 
iAC clones permitted extension of the se- 
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quence well into centromeric regions and al- 
lowed high-quality resolution of complex re- 
peat regions. Likewise, in Drosophila, the 
BAC physical map was most useful in re- 
gions near the highly repetitive centromeres 
and telomeres. WGA has been found to de- 
liver excellent-quality reconstructions of the 
unique regions of the genome. As the genome 
size, and more importantly the repetitive con- 
tent, increases, the WGA approach delivers 
less of the repetitive sequence. 

The cost and overall efficiency of clone-by- 
clone approaches makes them difficult to justify 
as a stand-alone strategy for future large-scale 
genome-sequencing projects. Specific applica- 
tions of BAC-based or other clone mapping and 
sequencing strategies to resolve ambiguities in . 
sequence assembly that cannot be efficiently 
resolved with computational approaches alone 
are clearly worth exploring. Hybrid approaches 
to whole-genome sequencing will only work if 
there is sufficient coverage in both the whole- 
genome shotgun phase and the BAC clone se- 
quencing phase.. Our experience with human 
genome assembly suggests that this will require 
at least 3 X coverage of both whole-genome and 
BAC shotgun sequence data. 

8.2 The low gene number in humans 

We have sequenced and assembled ~95% of 
the euchromatic sequence , of H. sapiens and 
used a new automated gene prediction meth- 
od to produce a preliminary catalog of the . 
human genes. This has provided a major sur- 
prise: We have found far fewer genes (26,000 
to 38,000) than the earlier molecular pre- 
dictions (50,000 to over 140,000). Whatever 
the reasons for this current disparity, only 
detailed annotation, comparative genomics 
(particularly using the Mus musculus ge- 
nome), and careful molecular dissection of 
complex phenotypes will clarify this critical 
issue of the basic "parts list" of our genome. 
Certainly, the analysis is still incomplete and 
considerable refinement will occur in the 
years to come as the precise structure of each 
transcription unit is evaluated. A good place 
to start is to determine why the gene esti- 
mates derived from EST data are so discor- 
dant with our predictions. It is likely that the 
following contribute to an inflated gene num- 
ber derived from ESTs: the variable lengths 
of 3 r - and 5 '-untranslated leaders and trailers; 
the little-understood vagaries of RNA pro- 
cessing that often leave intronic regions in an 
unspliced condition; the finding that nearly 
40% of human genes are alternatively spliced 
(153); and finally, the unsolved technical 
problems in EST library construction where 
contamination from heterogeneous nuclear 
RNA and genomic DNA are not uncommon. 
Of course, it is possible that there are genes 
that remain unpredicted owing to the absence 
of EST or protein data to support them, al- 
though our use of mouse genome data for 



predicting genes should limit this number. As 
was true at the beginning of genome sequenc- 
ing, ultimately it will be necessary to measure 
. mRNA in specific cell types to demonstrate 
the presence of a gene. 

J. B. S. Haldane speculated in 1937 that a 
population of organisms might, have to pay a 
.. price for the number of genes .it can possibly 
carry. He theorized, that when the number of 
genes becomes too. large, each zygote carries 
so many new deleterious mutations that the 
population simply cannot: maintain itself. On 
the basis of this premise, and on the basis of 
available mutation rates and x-ray-induced 
mutations at specific loci, Muller, in 1967 
(154), calculated that the mammalian ge- 
. nome would contain a maximum of not much 
more than 30,000 genes (155). An estimate of 
. 30,000 gene loci for humans was also arrived 
at by Crow and Kimura (156). Muller's esti- 
mate for D. melanogaster was 10,000 genes, 
compared to 13,000 derived by annotation of 
the fly genome (26, 27). These arguments for 
the theoretical maximum gene number were 
based on simplified ideas of genetic load — 
that all genes have a certain low rate of 
mutation to a deleterious state. However, it is 
clear that many mouse, fly, worm, and yeast 
knockout mutations lead to almost no dis- 
cernible phenotypic perturbations. 

The modest, number of human genes , 
means that we must look elsewhere for the 
mechanisms that generate the complexities 
inherent in human development and the so- , 
phisticated signaling systems that maintain 
homeostasis. There are a large number of 
ways in which the functions of individual 
genes and gene products are regulated. The 
. degree of "openness" of chromatin structure : 
and hence transcriptional activity is regulated - 
by : protein complexes that involve histone 
and DNA enzymatic modifications. We enu- 
merate many of the proteins that are likely 
involved in nuclear regulation in Table 19. 
The location, timing, and quantity of tran- 
scription are intimately linked to nuclear sig- 
nal . transduction events as well as by the 
tissue-specific expression of many of these 
proteins. Equally important are regulatory 
DNA elements that include insulators, re- 
peats, and endogenous viruses (157); meth- 
ylation of CpG islands in imprinting (158); 
and promoter-enhancer and intronic regions 
that modulate transcription. The spliceosomal 
machinery consists of multisubunit proteins 
(Table 19) as well as structural and catalytic 
RNA elements (159) that regulate transcript 
structure through alternative start and termi- 
nation sites and splicing. Hence, there is a 
need to study different classes of RNA mol- 
ecules (160) such as small nucleolar RNAs, 
antisense riboregulator RNA, RNA involved 
in X-dosage compensation, and other struc- 
tural RNAs to appreciate their precise role in 
regulating gene expression. The phenomenon 



; of RNA editing in which coding changes 
occur directly at the level of mRNA is of 
clinical and biological relevance (161). Final- 
ly, examples of translational control include 
1 internal ribosomal entry sites that are found 
. in proteins involved in cell cycle regulation 
and apoptosis (162). At the protein level, 
: ' minor ^alterations in ; the * nature of. protein- 
. protein interactions, .protein modifications, 
and localization can have dramatic effects on 
cellular physiology (163). This dynamic sys- 
tem therefore has many ways to modulate 
activity, which suggests that definition of 
complex systems by analysis of single genes 
is unlikely to be entirely successful. 

In situ studies have shown that the human 
.. genome is asymmetrically , populated with 
/ G+C content, CpG islands, and genes (68). 

However, the genes are not distributed quite 
. as unequally as had been predicted (Table 9) 
(69). The most G+C-rich fraction of the ge- 
nome, H3 isochores, constitute more of the 
genome than previously thought (about 9%), 
. and are the most gene-dense . fraction, but 
contain only 25% of the genes, rather than the 
predicted -40%. The low G+C L isochores 
make up 65% of the genome, and 48% of the 
genes. This inhomogeneity, the net result of 
millions of years of mammalian gene dupli- 
cation, has been described as the "desertifi- 
cation" of the vertebrate genome (71). Why 
are there clustered regions of high and low 
gene density, and are these accidents of his- 
tory or driven by selection and evolution? If 
these deserts are dispensable, it ought to be 
. possible to find mammalian genomes that are 
far smaller in size than the human genome. 
Indeed, many species of bats have genome 
sizes that are much smaller than that of hu- 
... mans; for example, Miniopterus, a species of 
Italian bat, has a genome size that is only 
50% that of humans (164). Similarly, Mun- 
tiacus, a species of Asian barking deer, has a 
genome size that is —70% that of humans. 

8.3 Human DNA sequence variation 
and its distribution across the genome 

This is the first eukaryotic genome in which a 
nearly uniform ascertainment of polymorphism 
has been completed Although we have identi- 
fied and mapped more than 3 million SNPs, this 
by no means implies that the task of finding and 
cataloging SNPs is complete. These represent 
only a fraction of the SNPs present in the 
human population as a whole. Nevertheless, 
this first glimpse at genome-wide variation has 
revealed strong ^homogeneities in the distribu- 
tion of SNPs across the genome. Polymorphism 
in DNA carries with it a snapshot of the past 
operation of population genetic forces, includ- 
ing mutation, migration, selection, and genetic 
drift The availability of a dense array of SNPs 
will allow questions related to each of these 
factors to be addressed on a genome-wide basis. 
SNP studies can establish the range of haplo- 
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types present in subjects of different ethnogeo- 
graphic origins, providing insights into popula- 
tion history and migration patterns. Although 
such studies have suggested that modern human 
lineages derive from Africa, many important 
questions regarding human origins remain un- 
answered, and more analyses using detailed 
SNP maps will be needed to settle these con- 
troversies. In addition to providing evidence for 
population expansions, migration, and admix- 
ture, SNPs can serve as markers for the extent 
of evolutionary constraint acting on particular 
genes. The correlation between patterns of in- 
traspecies and interspecies genetic variation 
may prove to be especially informative to iden- 
tify sites of reduced genetic diversity that may 
mark loci where sequence variations are not 
tolerated 

The remarkable heterogeneity in SNP 
density implies that there are a variety of 
forces acting on polymorphism— sparse re- 
gions may have lower SNP density because 
the mutation rate is lower, because most of 
those regions have a lower fraction of muta- 
tions that are tolerated, or because recent 
strong selection in favor of a newly arisen 
allele "swept" the linked variation out of the 
population {165). The effect of random ge- 
netic drift also varies widely across the ge- 
nome. The nonrecombining portion of the Y 
chromosome faces the strongest pressure 
from random drift because there are roughly 
one-quarter as many Y chromosomes in the 
L lopulation as there are autosomal chromo- 
imes, and the level of polymorphism on the 
Y is correspondingly less. Similarly, the X 
chromosome has a smaller effective popu- 
lation size than the autosomes, and its nu- 
cleotide diversity is also reduced. But even 
across a single autosome, the effective pop- 
ulation size can vary because the density of 
deleterious mutations may vary. Regions of 
high density of deleterious mutations will 
see a greater rate of elimination by selec- 
tion, and the effective population size will 
be smaller (166). As a result, the density of 
even completely neutral SNPs will be lower 
in such regions. There is a large literature 
on the association between SNP density 
and local recombination rates in Drosoph- 
i!a, and it remains an important task to 
assess the strength of this association in the 
human genome, because of its impact on 
the design of local SNP densities for dis- 
ease-association studies. It also remains an 
important task to validate SNPs on a 
genomic scale in order to assess the degree 
of heterogeneity among geographic and 
ethnic populations. 
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8.4 Gen me c mplexity 

'e will soon be in a position to move away 
Im the cataloging of individual compo- 
rts of the system, and beyond the sim- 
plistic notions of "this binds to that, which 



then docks on this, and then the complex 

moves there (167) to the exciting area 

of network perturbations, . nonlinear re- 
. sponses and thresholds, and their pivotal 
role in human diseases. 

The enumeration of other "parts lists** re- 
veals that in organisms , with complex nervous 
systems, neither gene number, neuron number, 
nor number of cell , types correlates in any . 
meaningful. manner with even simplistic mea-. 
sures of structural ..or behavioral !> complexity^ 
Nor would they be expected to; this is the realm 
of nonlinearities and epigenesis (168). The 520 
. million neurons of the common octopus exceed . 
the neuronal number in the brain of a mouse by 
an order of magnitude. It is apparent from a 
comparison of genomic data on the mouse and 
human, and from comparative mammalian neu- 
roanatomy (169), that the morphological and 
behavioral diversity found in mammals is un- 
derpinned by a similar gene repertoire and sim- 
ilar neuroanatomies. For example, when one 
compares a pygmy marmoset (which is only 4 - 
inches tall and weighs about 6 ounces) to a 
, chimpanzee, the brain volume of this minute 
primate is found to be only about 1.5 cm 3 , two 
orders of magnitude less than that of a chimp 
and three orders less than that of humans. Yet . 
the neuroanatomies of all three brains are strik- 
ingly similar, and the behavioral characteristics 
of the pygmy marmoset are little different from 
those of chimpanzees. Between humans and 
chimpanzees, the gene number, gene structures 
and functions, chromosomal and genomic or- 
ganizations, and cell types and neuroanatomies 
are almost indistinguishable, yet the develop- 
mental modifications that predisposed human 
lineages to cortical expansion and development 
of the larynx, giving rise.to language, culminat- 
ed in a massive singularity that by even the 
simplest of criteria made humans more com- 
plex in a behavioral sense. 

. Simple examination of the number of neu- 
rons, cell types, or genes or of the genome 
size does not alone account for the differenc- 
es in complexity that we observe. Rather, it is 
the interactions within and among these sets 
that result in such great variation. In addition, 
it is possible that there are "special cases" of 
regulatory gene networks that have a dispro- 
portionate effect on the overall system. We 
have presented several examples of "regula- 
tory genes" that are significantly increased in 
the human genome compared with the fly and 
worm. These include extracellular ligands 
and their cognate receptors (e.g., wnt, friz- 
zled, TGF-3, ephrins, and connexins), as well 
as nuclear regulators (e.g., the KRAB and 
homeodomain transcription factor families), 
where a few proteins control broad develop- 
mental processes. The answers to these 
"complexities" perhaps lie in these expanded 
gene families and differences in the regulato- 
ry control of ancient genes, proteins, path- 
ways, and cells. 



8.5 B yond single comp nents 

While few would disagree with the intuitive 

• conclusion that Einstein's brain was more 

• complex than that of Drosophi la, closer com- 
parisons such as whether the set of predicted 
human proteins is more complex than the 
protein set of Drosophila, and if so, to what 
degree, are not straightforward, since protein, 

. . protein, domain, or protein-protein interaction 
measures do not capture . context-dependent 
interactions that underpin the dynamics un- 
derlying phenotype. 

Currently, there are more than 30 different 
mathematical descriptions of complexity (170). 
However, we have yet to understand the math- 
ematical dependency relating the number of 
genes with organism complexity. One pragmat- 
ic approach to the analysis of biological sys- 
tems, which are composed of nonidentical ele- 
ments (proteins, protein complexes, interacting 
cell , types, and . interacting neuronal popula- 
i tions), is through graph theory (171). The ele- 
ments of the system can be represented by the 
vertices of complex topographies, with the edg- 
es representing the interactions between them 
Examination of large networks reveals that they 
can self-organize, but more important, they can 
be particularly robust This robustness is not 
due to redundancy, but is a property of inho- , 
mogeneously wired networks. The error toler- 
ance of such networks comes with a price; they 
are vulnerable to the selection or removal of a 
few nodes that contribute disproportionately to 
network stability. Gene knockouts provide an 
illustration. Some knockouts may have minor 
effects, whereas others have catastrophic effects 
on the system. In the case of vimentin, a sup- 
posedly critical component of the cytoplasmic 
intermediate filament network of mammals, the 
knockout of the gene in mice reveals them to be 
reproductively normal, with no obvious pheno- 
typic effects (772), and yet the usually conspic- 
uous vimentin network is completely absent 
On the other hand, -30% of knockouts in 
Drosophila and mice correspond to critical 
nodes whose reduction in gene product, or total 
elimination, causes the network to crash most 
of the time, although even in some of these 
cases, phenotypic normalcy ensues, given the 
appropriate genetic background Thus, there are 
no "good" genes or "bad" genes, -but only net- 
works that exist at various levels and at differ- 
ent connectivities, and at different states of 
sensitivity to perturbation. Sophisticated math- 
ematical analysis needs to be constantly evalu- 
ated against hard biological data sets that spe- 
cifically address network dynamics. Nowhere is 
this more critical than in attempts to come to 
grips with "complexity," particularly because 
deconvoluting and correcting complex net- 
works that have undergone perturbation, and 
have resulted in human diseases, is the greatest 
significant challenge now facing us. 

It has been predicted for the last 15 years 
that complete sequencing of the human ge- 
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nome would open up new strategies for hu-. 
m man biological research and would have a 
W major impact on medicine, and through med- 
icine and public health, on society. Effects on 
biomedical research are already being felt. 
This assembly of the human genome se- 
quence is but a first, hesitant step on a long 
and exciting journey toward understanding 
the role of the genome in human biology. It 
has been possible only because of innova- 
tions in instrumentation and software that 
have allowed automation of almost every step 
of the process from DNA preparation to an- 
notation. The next steps are clear: We must 
. define the complexity that ensues when this 
relatively modest set of about 30,000 genes is 
expressed. The sequence provides the frame- 
work upon which all the genetics, biochem- 
istry, physiology, and ultimately phenorype 
depend. It provides the boundaries for scien- 
tific inquiry. The sequence is only the first 
level of understanding of the genome. All 
genes and their control elements must be 
identified; their functions, in concert as well 
as in isolation, defined; their sequence varia- 
tion worldwide described; and the relation 
between genome variation and specific phe- 
notypic characteristics determined. Now we 
know what we have to explain. 

Another paramount challenge awaits: 
public discussion of this information and its 
potential for improvement of personal health. 
Many diverse sources of data have shown 
that any two individuals are more than 99.9% 
identical in sequence, which means that all 
the glorious differences among individuals in 
our species that can be attributed to genes 
falls in a mere 0.1% of the sequence. There 
are two fallacies to be avoided: determinism, 
the idea that all characteristics of the person 
are "hard-wired" by the genome; and reduo 
tionism, the view that with complete knowl- 
edge of the human genome sequence, it is 
only a matter of time before our understand- 
ing of gene functions and interactions will 
provide a complete causal description of hu- 
man variability. The real challenge of human 
biology, beyond the task of finding out how 
genes orchestrate the construction and main- 
tenance of the miraculous mechanism of our 
bodies, will lie ahead as we seek to explain 
how our minds have come to organize 
thoughts sufficiently well to investigate our 
own existence. 
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). A historic 
moment for 
the scientific 
endeavor 



umanity has been given a great gift. With the completionof the human 
genome sequence, we have received a powerful tool for unlocking the 
secrets of our genetic heritage and for finding our place among the other 
participants in the adventure of life. 

This week's issue of Science contains the report of the sequencing of 
the human genome from a group of authors led by Craig Venter!of Celera 
Genomics. The report of the sequencing of the human genome from the 
publicly funded consortium of laboratories led by Francis Colliris appears 
in this week's Nature. This stunning achievement has been portrayed — 
often unfairly — as a competition between two 
ventures, one public and one private. That characterization detracts from 
the awesome accomplishment jointly unveiled this week. In truth, each 
project contributed to the other. The inspired vision that launched the 
publicly funded project roughly 10 years ago reflected, and now rewards, 
the confidence of those who believe that the pursuit of large-scale funda- 
mental problems in the life sciences is in the national interest The technical 
innovation and drive of Craig Venter and his colleagues made it possible 
to celebrate this accomplishment far sooner than was believed possible. 
Thus, we can salute what has become, in the end, not a contest but a 
marriage (perhaps encouraged by shotgun) between public funding and 
private entrepreneurship. 

There are excellent scientific reasons for applauding an outcome that 
has given us two winners. Two sequences are better than one; the opportunity for comparison and con- 
vergence is invaluable. Indeed, a real-wt>rld proof of the importance of access to both sets of data can 
be found in the pages of this issue of Science, in the comparative analysis by Olivier et al (p. 1298). 

Although we have made the point before, it is worth repeating that the sequencing of the human 
genome represents, not an ending, but the beginning of a new approach to biology. As Galas says in 
his Viewpoint (p. 1257), the knowledge that all of the genetic components of any process can be 
identified will give extraordinary new power to scientists. Because of this breakthrough, research 
can evolve from analyzing the effects of individual genes to a more integrated view that examines 
whole ensembles of genes as they interact to form a living human being. Several articles in this issue 
highlight how this approach is already beginning to revolutionize the way we look at human disease. 

This has been a massive project, on a scale unparalleled in the history of biology, but of course 
it has built on the scientific insights of centuries of investigators. By coincidence, this landmark 
announcement falls during the week of the anniversary of the birth of Charles Darwin. DarwuVs 
message that the survival of a species can depend on its ability to evolve in the face of change is 
peculiarly pertinent to discussions that have gone on in the past year over access to the Celera data. 
' (Full information regarding the agreements that were reached to make the data available can be 
found at ww.sciencemag.org/feature/data/announcement/gsp.shl.) We are willing to be flexible, in 
allowing data repositories other than the traditional GenBank, while insisting on access to all the 
data needed to verify conclusions. In this domain, change is everywhere: Commercial researchers 
are producing more and more potentially valuable sequences, yet (at least in the United States) 
laws governing databases provide scant protection against piracy. Had the Celera data been kept se- 
cret, it would have been a serious loss to the scientific community. We hope that our adaptability in 
the face of change will enable other proprietary data to be published after peer review, in a way that 
satisfies our continuing commitment to full access. *' • I'- 

ll should be no surprise that an achievement so stunning, and so carefully watched, has created 
new challenges for the scientific venture. Science is proud to have played a role in bringing this 
discovery onto the public stage. It is literally true that this is a historic moment for the scientific en- 
deavor. The human genome has been called the Book of Life. Rather, it is a library, in which, with 
rules that encourage exploration and reward creativity, we can find many of the books that will 
help define us and our place in the great tapestry of life. - 

Barbara RJasnyand Donald Kennedy 
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>NM_139238 ACCESSION :NM_13 92 38 NID: gi 21327692 ref NM_139238.1 
Homo sapiens ADAMTS-like 1 (ADAMTSL1) , transcript 
variant 1, mRNA 
Length = 2317 

Identities = 668/669 (99%), Positives = 669/669 (100%) 
Frame = +2 

Query: 1 MECCRRATPGTLLLFLAFLLLSSRTARSEEDRDGLWDAWGPWSECSRTCGGGASYSLRRC 60 

MECCRRATPGTLLLFLAFLLLSSRTARSEEDRDGLWDAWGPWSECSRTCGGGASYSLRRC 
Sbjct : 65 MECCRRATPGTLLLFLAFLLLSSRTARSEEDRDGLWDAWGPWSECSRTCGGGASYSLRRC . 244 

Query: 61 LSSKSCEGRNIRYRTCSNVDCPPEAGDFRAQQCSAHNDVKHHGQFYEWLPVSNDPDNPCS 120 

LSSKSCEGRNIRYRTCSNVIDCPPEAGDFRAQQCSAHNDVKHHGQFYEWLPVSNDPDNPCS 
Sbjct: 245 LS SKSC EGRNI RYRTC SNVDC P PEAGDFRAQQC S AHNDVKHHGQF YEWLPVSNDPDNPC S 424 

Query: 121 LKCQAKGTTLVVELAPKVLDGTRC YTESLDMC I SGLCQIVGCDHQLGSTVKEDNCGVCNG 180 

LKCQAKGTTLWELAPKVLDGTRCYTESLDMCISGLCQIVGCDHQLGSTVKEDNCGVCNG 
Sbjct: 425 LKCQAKGTTLWELAPKVLDGTRC YTESLDMC I SGLCQIVGCDHQLGSTVKEDNCGVCNG 604 

-Query: 181 DGSTCRLVRGQYKSQLSATKSDDTWAIPYGSRHIRLVLKGPDHLYLETKTLQGTKGENS 240 

DGSTCRLVRGQYKSQLSATKSDDTWAIPYGSRHIRLVLKGPDHLYLETKTLQGTKGENS 
Sbjct: 605 DGSTCRLVRGQYKSQLSATKSDDTWAIPYGSRHIRLVLKGPDHLYLETKTLQGTKGENS 784 

Query: 241 LSSTGTFLVX)NSSVDFQKFPDKEILRMAGPLTADFIVKIRNSGSADSTVQFIFYQPIIHR 3 00 

L+STGTFLVX)NSSVDFQKFPDKEILRMAGPLTADFIVKIRNSGSADSTVQFIFYQPIIHR 
Sbjct: 785 LNSTGTFLVlDNSSvI)FQKFPDKEILRMAGPLTADFIVKIIlNSGSADSTVQFIFYQPIIHR 964 

Query: 301 WRETDFFPCSATCGGGYQLTSAECYDLRSNRWADQYCHYYPENIKPKPKLQECNLDPCP 360 

WRETDFFPCSATCGGGYQLTSAECYDLRSNRWADQYCHYYPENIKPKPKLQECNLDPCP 
Sbjct: 965 WRETDFFPCSATCGGGYQLTSAECYDLRSNRWADQYCHYYPENIKPKPKLQECNLDPCP 1144 

Query: 361 ASDGYKQIMPYDLYHPLPRWEATPWTACSSSCGGGIQSRAVSCVEEDIQGHVTSVEEWKC 420 

ASDGYKQIMPYDLYHPLPRWEATPWTACSSSCGGGIQSRAVSCVEEDIQGHVTSVEEWKC 
Sbjct : 1145ASDGYKQIMPYDLYHPLPRWEATPWTACSSSCGGGIQSRAVSCVEEDIQGHVTSVEEWKC 1324 

Query: 421 MYTPKMPI AQPCNI FDC PKWLAQEWS PCTVTCGQGLRYRWLC IDHRGMHTGGC S PKTKP 480 

MYTPKMPI AQPCNI FDC PKWLAQEWS PCTVTCGQGLRYRWLC IDHRGMHTGGC S PKTKP 
Sbjct : 132 5MYTPKMP I AQPCNI FDC PKWLAQEWS PCTVTCGQGLRYRWLC IDHRGMHTGGC S PKTKP 1504 

Query: 481 HIKEECIVPTPCYKPKEKLPVEAKLPWFKQAQELEEGAAVSEEPSFIPEAWSACTVTCGV 540 

HIKEECIVPTPCYKPKEKLPVEAKLPWFKQAQELEEGAAVSEEPSFIPEAW i SACTVTCGV 
Sbjct : 1505HIKEECIVPTPCYKPKEKLPVEAKLPWFKQAQELEEGAAVSEEPSFIPEAWSACTVTCGV 1684 

Query: 541 GTQVRIVRCQVLLSFSQSVADLPIDECEGPKPASQRACYAGPCSGEIPEFNPDETDGLFG 600 

GTQVRIVRCQVLLSFSQSVADLPIDECEGPKPASQRACYAGPCSGEIPEFNPDETDGLFG 
Sbjct : 1685GTQVRIVRCQVLLSFSQSVADLPIDECEGPKPASQRACYAGPCSGEIPEFNPDETDGLFG 1864 

Query: 601 GLQDFDELYDWEYEGFTKCSESCGGGVQEAWSCLNKQTREPAEENLCVTSRRPPQLLKS 660 

GLQDFDELYDWEYEGFTKCSESCGGGVQEAWSCLNKQTREPAEENLCVTSRRPPQLLKS 
Sbjct : 1865 GLQDFDEL YDWE YEGFTKC S ESCGGGVQEAWSCLNKQTRE PAEENLC VTSRRP PQLLKS 2044 

Query: 661 CNLDPCPAR 669 

CNLDPCPAR 
Sbjct: 2045CNLDPCPAR 2069 



>NM_052866 ACCESSION: NM__0528 66 NID: gi 21327690 ref NM_052866.2 
Homo sapiens ADAMTS-like 1 (ADAMTSLl) , transcript 
variant 2 , mRNA 
Length = 1810 

Identities = 524/525 (99%), Positives = 525/525 (100%) 
Frame = +2 

Query: 1 MECCRRATPGTLLLFLAFLLLSSRTARSEEDRDGLWDAWGPWSECSRTCGGGASYSLRRC 60 

MECCRRATPGTLLLFLAFLLLSSRTARSEEDRDGLWDAWGPWSECSRTCGGGASYSLRRC 
Sbjct: 65 MECCRRATPGTLLLFLAFIiLLSSRTARSEEDRDGLWDAWGPWSECSRTCGGGASYSLRRC 244 

Query: 61 LSSKSCEGRNIRYRTCSNVDCPPEAGDFRAQQCSAHNDVTOIHGQFYEWLPVSND 120 

LS SKSCEGRNI RYRTC SNVDC PPEAGDFRAQQC S AHNDVKHHGQF YEWLPVSNDPDNPC S 
Sbjct: 245 LS SKSCEGRNI RYRTC SNVDC PPEAGDFRAQQC SAHNDVKHHGQFYEWLPVSNDPDNPCS 424 

Query: 121 LKCQAKGTTLWELAPKVLDGTRCYTESLDMCISGLCQIVGCDHQLGSTVKEDNCGVCNG 180 

LKCQAKGTTLWELAPKVLDGTRC YTESLDMC I SGLCQI VGCDHQLGSTVKEDNCGVCNG 
Sbjct: 425 LKCQAKGTTLWELAPKVLDGTRC YTESLDMC I SGLCQI VGCDHQLGSTVKEDNCGVCNG 604 

Query: 181 DGSTCRLVRGQYKSQLSATKSDDTWAIPYGSRHIRLVLKGPDHLYLETKTLQGTKGENS 240 

DGSTCRLVRGQYKSQLSATKSDDTWAIPYGSRHIRLVLKGPDHLYLETKTLQGTKGENS 
Sbjct: 605 DGSTCRLVRGQYKSQLSATKSDDTWAIPYGSRHIRLVLKGPDHLYLETKTLQGTKGENS 784 

Query: 241 LSSTGTFLV1DNSSVIDFQKFPDKEILRMAGPLTADFIVKIRNSGSADSTVQFIFYQPIIHR 300 

L+STGTFLVDNSSVDFQKFPDKEILRMAGPLTADFIVKIRNSGSADSTVQFIFYQPIIHR 
Sbjct: 785 LNSTGTFLVBNSSVI^FQKFPDKEILRMAGPLTADFIVXIRNSGSAbSTVQFIFYQPIIHR 964 

Query: 301 WRETDFFPCSATCGGGYQLTSAECYDLRSNRWADQYCHYYPENIKPKPKLQECNLDPCP 360 

WRETDFFPCSATCGGGYQLTSAECYDLRSNRWADQYCHYYPENIKPKPKLQECNLDPCP 
Sbjct: 965 WRETDFFPCSATCGGGYQLTSAECYDLRSNRWADQYCHYYPENIKPKPKLQECNLDPCP 1144 

Query: 361 ASDGYKQIMPYDLYHPLPRWEATPWTACSSSCGGGIQSRAVSCVEEDIQGHVTSVEEWKC 420 

ASDGYKQIMPYDLYHPLPRWEATPWTACSSSCGGGIQSRAVSCVEEDIQGHVTSVEEWKC 
Sbjct : 114 5 ASDGYKQIMPYDLYHPLPRWEATPWTACSS SCGGGI QSRAVSCVEEDIQGHVTSVEEWKC 1324 

Query: 421 MYTPKMPI AQPCNI FDC PKWLAQEWS PCTVTCGQGLRYRWLC IDHRGMHTGGC S PKTKP 480 

MYTPKMP I AQPCNI FDC PKWLAQEWS PCTVTCGQGLRYRWLC I DHRGMHTGGC S PKTKP 
Sbjct : 13 2 5MYTPKMPI AQPCNI FDC PKWLAQEWSPCTVTCGQGLRYRVVLCIDHRGMHTGGCS PKTKP 1504 

Query: 481 HIKEECIVPTPCYKPKEKLPVEAKLPWFKQAQELEEGAAVSEEPS 525 

HIKEECIVPTPCYKPKEKLPVEAKLPWFKQAQELEEGAAVSEEPS 
Sbjct: 1 5 0 5HIKEECIVPTPCYKPKEKLPVEAKLPWFKQAQELEEGAAVSEEPS 1639 
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NM_139238 2317 bp mRNA linear PRI 07-MAY-2003 

Homo sapiens ADAMTS-like 1 (ADAMTSLl ) , transcript variant 1, mRNA. 
NM__139238 

NM_139238.1 GI: 21327692 

Homo sapiens (human) 
Homo sapiens 

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi ; 
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. 
1 (bases 1 to 2317) 

Hirohata,S., Wang,L.W., Miyagi,M., Yan,L., Seldin,M.F., Keene,D.R., 
Crabb,J.W. and Apte, S.S. 

Punctin, a novel ADAMTS-like molecule, ADAMTSL-1, in extracellular 
matrix 

J. Biol. Chem. 277 (14), 12182-12189 (2002) 

21922817 

11805097 

GeneRIF: Punctin, a novel ADAMTS-like molecule, ADAMTSL-1, xn 
extracellular matrix 

REVIEWED REFSEQ ; This record has been curated by NCBI staff. The 
reference sequence was derived from AF251058.1 and BC030262.1. 

Summary: This gene. encodes a secreted protein resembling members of 
the ADAMTS (a disintegrin and metalloproteinase with thrombospondin 
motif) family. This protein lacks the propeptide region and the 
metalloproteinase and disintegrin-like domains, which are typical 
of the ADAMTS family, but contains other ADAMTS domains, including 
the thrombospondin type 1 motif. This protein may have important 
functions in the extracellular matrix. Alternative splicing of this 
gene results in 3 transcript variants encoding different isoforms. 

Transcript Variant: This variant (1) encodes the longest isoform 
(1) - 

Location/Qualifiers 
1. .2317 

/organism="Homo sapiens" 

/mol_type="mRNA" 

/db_xref = " taxon : 9 60 6 n 

/chromosome^ " 9 " 

/map="9p22.1" 

1..2317 

/ gene = "ADAMTSLl " 

/note=" synonyms: ADAMTS Rl , MGC40193" 
/ db_xre f =" Locu s ID : 92949 " 
65. .2116 

/gene= "ADAMTSLl " 

/note= " ADAM-TS related protein 1; thrombospondin; punctin" 
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/codon_start=l 

/product="ADAM-TS related protein 1 isoform 1" 
/protein_id=° NP_640329 .1 " 
/db_xref= "GI : 21327693 " 
/db,_xref="LocusID: 92949 " 

/ translations ■ MECCRRATPGTLLLFLAFLLLSSRTARSEEDRDGLWDAWGPWSE 
CSRTCGGGASYSLRRCDSSKSCEGRNIRYRTCSNVDCPPEAGDFRAQQCSAHNDVKHH 
GQFYEWLPVSNDPDNPCSLKCQAKGTTLWELAPKVLDGTRCYTESLDMCISGLCQIV 
GCDHQLGSTVKEDNCGVCNGDGSTCRLVRGQYKSQLSATKSDDTWAIPYGSRHIRLV 
LKGPDHLYLETKTLQGTKGENSLNSTGTFLVDNSSVDFQKFPDKEILRMAGPLTADFI 
VKIRNSGSADSTVQFIFYQPIIHRWRETDFFPCSATCGGGYQLTSAECYDLRSNRWA 
DQYCHYYPENIKPKPKLQECNLDPCPASDGYKQIMPYDLYHPLPRWEATPWTACSSSC 
GGGIQSRAVSCVEEDIQGHVTSVEEWKCl^TPKMPIAQPCNIFDCPKWIAQEWSPCTV 
TCGQGLRYRWLCIDHRGMHTGGCSPKTKPHIKEECIVPTPCYKPKEKLPVEAKLPWF 
KQAQELEEGAAVSEEPSFIPEAWSACTVTCGVGTQVRIVRCQVLLSFSQSVADLPIDE 
CEGPKPASQRACYAGPCSGEIPEFNPDETDGLFGGLQDFDELYDWEYEGFTKCSESCG 
GGVQEAVVSCLNKQTREPAEENLCVTSRRPPQLLKSCNLDPCPARSSIDSAWNACNVIj 
C" 

misc_feature 170. .310 

/gene= "ADAMTSL1 ■ 

/note="TSPl; Region: Thrombospondin type 1 repeats" 
/db xref="CDD: smart00209 n 
S variation complement (217) 

/alleles B T" 
/allele="A" 

/ db_xre f =" dbSNP : 2277160 " 
BASE COUNT 554 a 619 c 619 g 525 t 

ORIGIN 

1 gcaggcagag gagcacttag cagcttattc agtgtccgat tctgattccg gcaaggatcc 
61 aagcatggaa tgctgccgtc gggcaactcc tggcacactg ctcctctttc tggctttcct 
121 gctcctgagt tccaggaccg cacgctccga ggaggaccgg gacggcctat gggatgcctg 
181 gggcccatgg agtgaatgct cacgcacctg cgggggtggg gcctcctact ctctgaggcg 
241 ctgcctgagc agcaagagct gtgaaggaag aaatatccga tacagaacat gcagtaatgt 
301 ggactgccca ccagaagcag gtgatttccg agctcagcaa tgctcagctc ataatgatgt 
361 caagcaccat ggccagtttt atgaatggct tcctgtgtct aatgaccctg acaacccatg 
421 ttcactcaag tgccaagcca aaggaacaac cctggttgtt gaactagcac ctaaggtctt 
481 agatggtacg cgttgctata cagaatcttt ggatatgtgc atcagtggtt tatgccaaat 
541 tgttggctgc gatcaccagc tgggaagcac cgtcaaggaa gataactgtg gggtctgcaa 
601 cggagatggg tccacctgcc ggctggtccg agggcagtat aaatcccagc tctccgcaac 
661 caaatcggat gatactgtgg ttgcaattcc ctatggaagt agacatattc gccttgtctt 
721 aaaaggtcct gatcacttat atctggaaac caaaaccctc caggggacta aaggtgaaaa 
781 cagtctcaac tccacaggaa ctttccttgt ggacaattct agtgtggact tccagaaatt 
841 tccagacaaa gagatactga gaatggctgg accactcaca gcagatttca ttgtcaagat 
901 tcgtaactcg ggctccgctg acagtacagt ccagttcatc ttctatcaac ccatcatcca 
961 ccgatggagg gagacggatt tctttccttg ctcagcaacc tgtggaggag gttatcagct 
1021 gacatcggct gagtgctacg atctgaggag caaccgtgtg gttgctgacc aatactgtca 
1081 ctattaccca gagaacatca aacccaaacc caagcttcag gagtgcaact tggatccttg 
1141 tccagccagt gacggataca agcagatcat gccttatgac ctctaccatc cccttcctcg 
1201 gtgggaggcc accccatgga ccgcgtgctc ctcctcgtgt ggggggggca tccagagccg 
12 61 ggcagtttcc tgtgtggagg aggacatcca ggggcatgtc . acttcagtgg aagagtggaa 
1321 atgcatgtac acccctaaga tgcccatcgc gcagccctgc aacatttttg actgccctaa 
1381 atggctggca caggagtggt ctccgtgcac agtgacatgt ggccagggcc tcagataccg 
1441 tgtggtcctc tgcatcgacc atcgaggaat gcacacagga ggctgtagcc caaaaacaaa 
1501 gccccacata aaagaggaat gcatcgtacc cactccctgc tataaaccca aagagaaact 
1561 tccagtcgag gccaagttgc catggttcaa acaagctcaa gagctagaag aaggagctgc 
1621 tgtgtcagag gagccctcgt tcatcccaga ggcctggtcg gcctgcacag tcacctgtgg 
1681 tgtggggacc caggtgcgaa tagtcaggtg ccaggtgctc ctgtctttct ctcagtccgt 
1741 ggctgacctg cctattgacg agtgtgaagg gcccaagcca gcatcccagc gtgcctgtta 
1801 tgcaggccca tgcagcgggg aaattcctga gttcaaccca gacgagacag atgggctctt 
1861 tggtggcctg caggatttcg acgagctgta tgactgggag tatgaggggt tcaccaagtg 
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1921 ctccgagtcc tgtggaggag gtgtccagga 
1981 tcgggagcct gctgaggaga acctgtgcgt 
2041 gtcctgcaat ttggatccct gcccagcaag 
2101 caacgttctt tgttaggcaa ccaagaggcc 
2161 tctgtggcct agggcgaggt gtctgccctt 
2221 tgtacctgat gatctgagat cccatgactt 
2281 aggcagaagc attaaacagc tactcctgct 
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ggctgtggtg agctgcttga acaaacagac 
gaccagccgc cggcccccac agctcctgaa 
aagcagtatc gactcagcat ggaacgcctg 
tggcttctca tcctgctgtc accaactagc 
tatgtttcca catctgcaaa gtgaactggt 
gctcacatgt cccatgattc tttattttgt 
gctgtgt 
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NM_052866 1810 bp mRNA linear PRi 07-MAY-2003 

Homo sapiens ADAMTS - 1 ike 1 (ADAMTSL1 ) , transcript variant 2, mRNA. 
NM_052866 

NM_052866.2 GI:21327690 

Homo sapiens (human) 
Homo sapiens 

Eukaryota; Metazoa; Chordata; Crania ta; Vertebrata; Euteleostomi ; 
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. 
1 (bases 1 to 1810) 

Hirohata,S., Wang,L.W., Miyagi,M., Yan,L., Seldin,M.F., Keene,D.R., 
Crabb,J.W. and Apte,S.S. 

Punctin, a novel ADAMTS - 1 ike molecule, ADAMTSL-1, in extracellular 
matrix 

J. Biol. Chem. 277 '(14), 12182-12189 (2002) 

21922817 

11805097 

GeneRIF: Punctin, a novel ADAMTS -like molecule, ADAMTSL-1, in 
extracellular matrix 

REVIEWED REFSEQ : This record has been curated by NCBI staff. The 
reference sequence was derived from AF176313 . 1 and BC030262 . 1 . 
On Jun 6, 2002 this sequence version replaced gi : 164183_68 . 

Summary: This gene encodes a secreted protein resembling members of 
the ADAMTS (a disintegrin and metalloproteinase with thrombospondin 
motif) family. This protein lacks the propeptide region and the 
metalloproteinase and disintegrin-like domains, which are typical 
of the ADAMTS family, but contains other ADAMTS domains, including 
the thrombospondin type 1 motif. This protein may have important 
functions in the extracellular matrix. Alternative splicing of this 
gene results in 3 transcript variants encoding different isoforms. 

Transcript Variant: This variant (2) has alternate 3' exons, as 
compared to variant 1, resulting in immediate translation 
termination. Isoform 2 is truncated at the C- terminus, compared to 
isoform 1. 

COMPLETENESS: complete on the 3' end. 
Location/Qualifiers 
1. .1810 

/ organ ism= n Homo sapiens" 

/mol_type= "mRNA" 

/ db_xr e f = ■ t axon : 9 6 0 6 " 

/chromosome= ■ 9 " 

/map= ,, 9p22.1 ,, 

1. .1810 

/gene= " ADAMTSL1 " 

/note= n synonyms: ADAMTS Rl , MGC40193" 
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/db xref="LocusID: 92949 " 
65.. 1642 

/gene= " ADAMTSL1 ■ 

/no te= "ADAM-TS related protein 1; thrombospondin; punctin" 
/codon_start=l 

/product= "ADAM-TS related protein. 1 isoform 2" 
/protein id=" NP 443098.2 " 
/db_xref="GI: 21327691" 
/ db_xr e f = ■ Locus ID : 92949 " 

/ 1 ran s 1 a t i on = " MECCRRAT PGTLLLFLAFLLL S S RTAR S E EDRDGLWD AWG P WS E 
CSRTCGGGASYSLRRCLSSKSCEGRNIRYRTCSNVDCPPEAGDFRAQQCSAHNDVKHH 
GQFYEWLPVSNDPDNPCSLKCQAKGTTLWELAPKVXDGTRCYTESLDMCISGLCQIV 
GCDHQLGSTVKEDNCGVCNGDGSTCRLVRGQYKSQLSATKSDDTVVAIPYGSRHIRLV 
LKGPDHLYLETKTLQGTKGENSLNSTGTFLVDNS S VDFQKFPDKE I LRMAG PLTADF I 
VKIRNSGSADSTVQFIFYQPIIHRWRETDFFPCSATCGGGYQLTSAECYDLRSNRWA 
DQYCHYYPENIKPKPKLQECNLDPCPASDGYKQIMPYDLYHPLPRWEATPWTACSSSC 
GGGIQSRAVSC^EDIQGHVTSVEEWKCMYTPKMPIAQPCNIFDCPKWLAQEWSPCTV 
TCGQGLRYRWLCIDHRGMHTGGCSPKTKPHIKEECIVPTPCYKPKEKLPVEAKLPWF 

KQAQELEEGAAVSEEPS D 
170. .310 

/gene= " ADAMTSLl " 

/note="TSPl; Region: Thrombospondin type 1 repeats" 

/ db_xr e f = n CDD : smart00209 " 

complement (217) 

/allele="T" 

/allele= M A n 

/db xref = "dbSNP: 2277160 " 

1678.. 1679 

/ gene = "ADAMTSLl" 

/allele= ,, GT" 

/allele="-" 

/db_xref= " dbSNP : 3833713 " 
1771.. 1776 
/gene= "ADAMTSLl " 
1795 

/gene= "ADAMTSLl n 

459 c ' 453 g 417 t 



BASE COUNT 481 a 

ORIGIN 

1 gcaggcagag gagcacttag cagcttattc agtgtccgat tctgattccg 
61 aagcatggaa tgctgccgtc gggcaactcc tggcacactg ctcctctttc 
121 gctcctgagt tccaggaccg cacgctccga ggaggaccgg gacggcctat 
181 gggcccatgg agtgaatgct cacgcacctg cgggggtggg gcctcctact 
241 ctgcctgagc agcaagagct gtgaaggaag aaatatccga tacagaacat 
301 ggactgccca ccagaagcag gtgatttccg agctcagcaa tgctcagctc 
361 caagcaccat ggccagtttt atgaatggct tcctgtgtct aatgaccctg 
421 ttcactcaag tgccaagcca aaggaacaac cctggttgtt gaactagcac 
481 agatggtacg cgttgctata cagaatcttt ggatatgtgc atcagtggtt 
541 tgttggctgc gatcaccagc tgggaagcac cgtcaaggaa gataactgtg 
601 cggagatggg tccacctgcc ggctggtccg; agggcagtat aaatcccagc 
661 caaatcggat gatactgtgg ttgcaattcc ctatggaagt agacatattc 
721 aaaaggtcct gatcacttat atctggaaac caaaaccctc caggggacta 
781 cagtctcaac tccacaggaa ctttccttgt ggacaattct agtgtggact 
841 tccagacaaa gagatactga gaatggctgg accactcaca gcagatttca 
901 tcgtaactcg ggctccgctg acagtacagt ccagttcatc ttctatcaac 
961 ccgatggagg gagacggatt tctttccttg ctcagcaacc tgtggaggag 
1021 gacatcggct gagtgctacg atctgaggag caaccgtgtg gttgctgacc 
1081 ctattaccca gagaacatca aacccaaacc caagcttcag gagtgcaact 
1141 tccagccagt gacggataca agcagatcat gccttatgac ctctaccatc 
1201 gtgggaggcc accccatgga ccgcgtgctc ctcctcgtgt ggggggggca 
1261 ggcagtttcc tgtgtggagg aggacatcca ggggcatgtc acttcagtgg 



gcaaggatcc 
tggctttcct 
gggatgcctg 
ctctgaggcg 
gcagtaatgt 
ataatgatgt 
acaacccatg 
ctaaggtctt 
tatgccaaat 
gggtctgcaa 
tctccgcaac 
gccttgtctt 
aaggtgaaaa 
tccagaaatt 
ttgtcaagat 
ccatcatcca 
gttatcagct 
aatactgtca 
tggatccttg 
cccttcctcg 
tccagagccg 
aagagtggaa 
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atgcatgtac 
atggctggca 
tgtggtcctc 
gccccacata 
tccagtcgag 
tgtgtcagag 
ttgtttaaag 
atcatctcac 
aaaaaaaaaa 



acccctaaga 
caggagtggt 
tgcatcgacc 
aaagaggaat 
gccaagttgc 
gagccctcgt 
aaagcagtgt 
caaagctttt 



tgcccatcgc 
ctccgtgcac 
atcgaggaat 
gcatcgtacc 
catggttcaa 
aagttgtaaa 
ctcactggtt 
tggctctcaa 



gcagccctgc 
agtgacatgt 
gcacacagga 
cactccctgc 
acaagctcaa 
agcacagact 
gtagctttca 
attaaagatt 



aacatttttg 
ggccagggcc 
ggctgtagcc 
tataaaccca 
gagctagaag 
gttctatatt 
tgggttctga 
gattagtttc 



actgccctaa 
tcagataccg 
caaaaacaaa 
aagagaaact 
aaggagctgc 
tgaaactgtt 
actaagtgta 
aaaaaaaaaa 
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Punctin (ADAMTSL-1) is a secreted molecule resem- 
bling members of the ADAMTS family of proteases. 
Punctin lacks the pro-metalloprotease and the disinte- 
grin-like domain typical of this family but contains 
other ADAJVITS domains in precise order including four 
thrombospondin type I repeats. Punctin is the product 
of a distinct gene on human chromosome 9p21-22 and 
mouse chromosome 4 that is expressed in adult skeletal 
, .muscle. His-tagged punctin expressed in stably trans- 
fee ted High-Five™ insect cells was purified to apparent 
homogeneity by Ni-chromatography of conditioned me- 
dium. The NH 2 terminus is not blocked and has the 
sequence EEDRD and so forth as determined by Edman 
degradation, demonstrating signal peptidase process- 
ing. Recombinant epitope-tagged punctin has a calcu- 
lated mass of 59,991 Da but exhibits major molecular 
species of 61970 ± 6 Da and 62131 ± 5 Da as measured by 
liquid chromatography electrospray mass spectrome- 
try. Punctin is a glycoprotein based on carbohydrate 
staining and liquid chromatography electrospray mass 
spectrometry glycopeptide analysis. Glycosylation oc- 
curs at a single iV-linked site as demonstrated by altered 
electrophoretic migration of punctin expressed in the 
presence of tunicamycin A. Punctin contains disulfide 
bonds based on antibody accessibility and electro- 
phoretic migration under reducing versus nonreducing 
conditions. Rotary shadowing demonstrates that punc- 
tin is hatchet-shaped having a globular region attached 
to a short stem. In transfected COS-1 cells, punctin is 
deposited in the cell substratum in a punctate fashion 
and is excluded from focal contacts. Punctin is the first 
member of a novel family of ADAMTS-like proteins that 
may have important functions in the extracellular 
matrix. 



Metalloproteases responsible for extracellular (ECM) 1 turn- 
over have a modular structure. Matrix metalloproteinases 
(MMPs) (1), a disintegrin-like and metalloprotease (ADAMs) 
(2), and proteases of the ADAMTS family (3, 4) are composed of 
characteristic domains arranged in a precise order that is the 
hallmark of each family. These enzymes are structurally and 
functionally bipartite consisting of an enzymatic domain at- 
tached to nonenzymatic or ancillary domains. The ancillary 
domains localize these proteases to substrates, the cell surface, 
or to the ECM. The ancillary domains of the gelatinases 
MMP-2 and MMP-9 are among the best studied of the sub- 
strate-binding domains. The fibronectin type II domains of the 
gelatinases are involved in binding to gelatin and some colla- 
gens as well as to fibronectin and heparin as in the case of 
MMP-2 (5, 6). The gelatin-binding domain of MMP-2 binds the 
matricellular proteins thrombospondin- 1 (TSP1) and TSP2 (7). 
Although neither is a substrate for MMP-2, the interaction may 
mediate the clearance of MMP-2 and affect cell-adhesive prop- 
erties (8). The MMP-2 hemopexin domain interacts with the 
carboxyl terminus of the tissue inhibitor of metalloproteases-2, 
facilitating pro-MMP-2 activation by membrane-type MMPs (1, 
5, 6, 9). The MMP-2 hemopexin domain also interacts with a 
chemokine called monocyte chemoattractant protein-3, which 
allows its processing by the catalytic domain (10). The disinte- 
grin domains of ADAMs such as ADAM-15 are implicated in 
cell-cell adhesion (2, 11, 12), and the ancillary domains of 
ADAMTS-1 are required for its binding to the ECM (13). In 
some ADAMs, the zinc-binding active site is nonfunctional, 
suggesting that they do not function as proteases at all but may 
instead have a primary role in adhesion via their ancillary 
domains (2). 

With this background, it is conceptually possible that gene 
products containing only the ancillary domains of ADAMTS 
may have specific functions in cell-cell or cell-matrix interac- 
tions or may regulate ADAMTS proteases. We have identified 
an ADAMTS-like (ADAMTSL) molecule named punctin, 2 
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1 The abbreviations used are: ECM, extracellular matrix; ADAMTSL, 
a disintegrin-like and metalloprotease domain with thrombospondin 
type I motifs HkerADAMTS, a disintegrin-like and metalloprotease 
domain with thrombospondin type I motifs; ADAM, a disintegrin-like 
and metalloprotease; MS, mass spectrometry; EST, expressed sequence 
tag; LC-ESMS, liquid chromatography-electrospray mass spectrometry; 
MALDI-TOF, matrix-assisted laser desorption ionization time-of-flight; 
MMP, matrix metalloprotease; ORF, open reading frame; PBS, phos- 
phate-buffered saline; RACE, rapid amplification of cDNA ends; TSP, 
thrombospondin; TS, thrombospondin type I domain; HexNAc, 
N-acetylhexosamine; NeuAc, iV-acetylneuraminic acid. 

2 Approved gene symbols ADAMTSL 1 and Adamtsll indicate hu- 
man and mouse orthologs, respectively. The corresponding protein 
product of these genes, ADAMTSL-1, is designated by the trivial name 
punctin because of its punctate distribution beneath transfected cells. 

This paper is available on line at http://www.jbc.org 
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ADAMTS 

which is the product of a gene distinct from any in the AD- 
AMTS family and is composed of ADAMTS ancillary domains 
alone. We have purified and characterized recombinant punc- 
tin produced in insect cells, visualized it by electron micros- 
copy, and demonstrated that it is a glycoprotein and a compo- 
nent of the ECM. 

EXPERIMENTAL PROCEDURES 

cDNA Cloning and Sequence Analysis — Using BLAST programs 
from the National Center for Biotechnology Information, we scanned 
the data base of ESTs using the protein sequences of ADAMTS pro- 
teases previously cloned by us (4, 14) and identified a human EST 
(GenBank™, accession number AA482392 encoded by IMAGE clone 
752797). The EST predicted a polypeptide with a similarity to the 
carboxyl half of cognate ADAMTS members but with no identities in 
GenBank™ 1 or other protein and nucleotide data bases. 

Using nested oligonucleotide primers based on the sequences at the 
5' and 3' ends of the IMAGE clone insert and human skeletal muscle 
cDNA (Marathon cDNA, CLONTECH, Palo Alto, CA) as the template, 
we performed RACE and extended the cDNA at 5' and 3' ends by PCR 
essentially as described previously (4, 14). 

Northern Blot Analysis — Multiple tissue Northern blots from adult 
human and mouse tissues (CLONTECH, Palo Alto, CA) were hybrid- 
ized to a [a- 32 PJdCTP-labeled punctin probe, a 1200-bp cDNA fragment 
from the 5' end of the punctin coding sequence, followed by autoradio- 
graphic exposure for 7 days. 

> Chromosomal Mapping and Genomic Arrangement — To determine 
"the chromosomal location of Adamtsll, we analyzed a panel of DNA 
samples from an interspecific cross that has been characterized for over 
1200 genetic markers throughout the mouse genome (15). Markers can 
be seen on the worldwide web (www.informatics.jax.org/searches/cross- 
data_form.shtml) by entering *DNA Mapping Panel Data Sets'* from the 
mouse genome data base and then selecting the "Seldin cross" and 
"Chromosome." Initially, DNA from the two parental mice, (C3H/HeJ- 
gld) and (C3H/HeJ-#W X Mus spretus) Fj), were digested with various 
restriction endonucleases and hybridized with the Adamtsll cDNA 
probe (IMAGE clone 2076907 with GenBank™ accession number 
AI787975) to determine restriction fragment length variants for haplo- 
type analyses. Gene linkage was determined by segregation analysis. 
Gene order was determined by analyzing all haplotypes and minimizing 
crossover frequency among all genes that were determined to be within 
a linkage group. This method resulted in the determination of the most 
probable gene order. To define the locus for ADAMTSLl, the human 
punctin cDNA sequence was used for BLAST searches of the human 
genome (Celera Sciences, Rockville, MD). 

Generation and Characterization ofAnti-punctin Antisera — The pep- 
tide (NH 2 )-[C]YYPENIKPKPKLQE-(OH) located in the third TS do- 
main of punctin (Fig. IS) was synthesized using Fmoc (A/-(9-fluorenyl) 
methoxycarbonyl) chemistry, purified by reverse-phase high-pressure 
liquid chromatography, and molecular weight was confirmed by MS 
(Alpha Diagnostic International, San Antonio, TX). A cysteine ([CD 
residue was included at the NH 2 terminus for coupling to keyhole 
limpet hemocyanin. Pep tide-keyhole limpet hemocyanin conjugate was 
dialyzed in PBS and used for immunization. Two New Zealand White 
male rabbits (7-8 pounds) were immunized with the conjugate (—200 
PLg/injection/rabbit, multiple intramuscular and subcutaneous sites) at 
biweekly intervals for 8 weeks. After an initial injection in Freund's 
complete adjuvant, subsequent injections were given in incomplete 
adjuvant. Antibody titer was measured by enzyme-linked immunosor- 
bent assay using free peptide. 

Immune sera were tested by Western blot analysis of extracts from 
COS-1 cells transiently transfected with punctin cDNA (see below). 
Although antisera from both rabbits (antisera 4112 and 4113) gave 
qualitatively similar results, the best signal/noise ratio was obtained 
with antiserum 4113. Affinity-purified antibodies were prepared by 
column chromatography of antiserum 4113 using the immobilized pep- 
tide immunogen. 

Expression and Purification of Recombinant Punctin from Insect 
Cells — High-Five™ cells (Invitrogen) were routinely cultured on tissue 
culture plastic and maintained at 27 °C in Ultimate™ serum-free in- 
sect cell medium (Invitrogen) as per manufacturer's directions. The 
full-length punctin ORF was excised from pcDN A3 . l/M yc-His B-TSL1 
(see below) with EcoKl and Notl and ligated into the corresponding sites 
in pIZT/V5-His (Invitrogen). The resulting insect cell expression plas- 
mid pIZT/V5-His-TSLl generated punctin with a COOH-terminal V5 
epitope and 6X His tag. pIZT/V5-His-TSLl was transfected into High- 
Five™ cells using Insectin-PIus liposomes (Invitrogen) and plated onto 
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100-mm Petri dishes. After 48 h, antibiotic selection (500 /xg/ml Zeocin, 
Invitrogen) was started and continued for 21 days. Colonies that sur- 
vived selection were picked manually, expanded, and maintained in 
medium containing Zeocin (50 jig/ml). Punctin production by isolated 
colonies was tested by Western blot analysis of conditioned medium 
. using anti-His monoclonal antibody (Invitrogen) and antibody 4113. 
For protein production, cells were grown in suspension in either 
Ultimate™ serum-free insect cell medium or Express-Five serum-free 
medium containing heparin (5 units/ml, Invitrogen). Production cul- 
tures were in spinner flasks, and culture medium was stored at -80 C C 
with 1 mM phenylmethylsulfonyl fluoride until use. For purification, 
medium was dialyzed into binding buffer (20 mM sodium phosphate, 
500 mM NaCl, pH 7.8) containing 0.03% Brij-35 (Sigma). Purification 
was performed using 1-liter batches of dialyzed medium and a 5-ml 
Ni-Sepharose column (ProBond™, Invitrogen) on an fast protein liquid 
chromatography instrument (Bio-Rad, Hercules, CA). Following bind- 
ing, the column was washed with three column volumes of binding 
buffer. A gradient of 0-42.5 mM imidazole in binding buffer was used to 
remove nonspecifically bound molecules from the column. Elution was 
with four column volumes of 250 mM imidazole in binding buffer, pH 
7.0, containing 0.03% Brij-35. Elution was monitored by in-line UV and 
conductivity measurements. 2-ml fractions of eluate were collected and 
tested by Western blot analysis as described above. Fractions contain- 
ing punctin were pooled. Protein concentration was determined using 
the Bradford assay (Bio-Rad) and by phenylthiocarbamyl amino acid 
analysis using an Applied Biosystems model 420H/130/920 automated 
analysis system (16). 

Characterization of Recombinant Punctin — The NH 2 -terminal se- 
quence of recombinant punctin was determined by Edman degradation. 
Recombinant punctin (5 fig) was electrophoresed on 10% SDS-PAGE, 
electrotransferred to polyvinylidene difluoride membrane, and lightly 
stained with modified Coomassie Blue (Simply Blue Safe Stain, Invitro- 
gen). Protein bands were excised and subjected to Edman degradation 
on an Applied Biosystems Precise 492 sequencer in the Molecular 
Biotechnology Core Facility of the Lerner Research Institute. 

To probe for glycosylation, recombinant punctin (4 ug) was electro- 
phoresed on 10% SDS-PAGE and stained for carbohydrate using a 
periodic acid-Schiff reaction-based method (Pro-Q fuchsia glycoprotein 
staining kit, Molecular Probes, Eugene, OR). In this reaction, Candy- 
Cane™ glycoprotein molecular weight standards consisting of alter- 
nate bands of glycosylated and unglycosylated proteins were used as 
controls. Glycoprotein staining was also performed after enzymatic 
deglycosylation of punctin with peptide //-glycosidase F. Deglycosyla- 
tion of denatured as well as native punctin was performed with a 
commercially available kit (Bio-Rad) using bovine fetuin as a control. To 
investigate further whether iV-linked carbohydrates were present in 
punctin, stably transfected insect cells were cultured in the presence or 
absence of tunicamycin Al homolog (0.1 jxg/ml culture medium, Sigma). 
Equal amounts of total protein from culture medium of tunicamycin- 
treated and untreated cells were assayed by Western blot with antibody 
4113 at various time points after the addition of tunicamycin 

Mass Spectrometry — The molecular mass of punctin was measured 
by MALDI-TOF and by LC-ESMS. MALDI-TOF was performed with a 
PerkinElmer Biosystems Voyager DE Pro-mass spectrometer using 
sinapinic acid as the matrix and bovine serum albumin as a calibration 
standard protein (17). MALDI : TOF MS measurements of intact punctin 
and naturally observed limited proteolysis fragments are reported ± 
50% peak width (in Da) at half-maximal peak height. LC-ESMS was 
performed with a PerkinElmer Sciex API 3000 triple quadruple mass 
spectrometer (17, 18). Nitrogen was used as the nebulization gas at 
40 p.s.i., and curtain gas was supplied from a nitrogen generator (What- 
man model 75-72). For LC-ESMS of intact punctin, a scan range of 
700-1800 mlz was used with 0.2 atomic mass unit steps, a scan time of 
7.5 s, and at an orifice potential of 80 and 5000 V ion spray. Reverse 
phase-high-pressure liquid chromatography was done at a flow rate of 
5 ptl/min on a 5-jtm Vydac C18 capillary column (0.3 X 150 mm, LC 
Packing) using an Applied Biosystems Model 140D high-pressure liquid 
chromatography system and aqueous acetonitrile/trifluoroacetic acid 
solvents with 100% of the eluant going to the mass spectrometer. ESMS 
measurements of intact punctin are reported as the mean — S.E. 
(in Da). 

For glycopeptide characterization, punctin was excised from a SDS- 
polyacrylamide gel (-1 ug/lone X 6 lanes), in-gel reduced with 10 mM 
dithiothreitol, cysteine-alkylated with 20 mM iodoacetamide in 400 mM 
ammonium bicarbonate, and digested with 0.2 ug of trypsin (Promega) 
overnight at 37 °C in 100 mM ammonium bicarbonate. Peptides from 
the in-gel tryptic digests were extracted with 60% acetonitrile contain- 
ing 0.1% trifluoroacetic acid, dried in a Speed Vac, redissolved in 50 /il 
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of 0.1% trifluoroacetic acid, and analyzed by LC-ESMS using selective 
ion monitoring with the PE Sciex API 3000 triple quadruple mass 
spectrometer system as described above for intact protein analyses. 
Glycopep tides were selectively detected based on diagnostic sugar oxo- 
nium ions HexNAc + Hex (m/z 366) and iV-acetylneuraminic acid 
(NeuAc) (m/z 292) (17). Carbohydrate marker ions at m/z 366 and 292 
(dwell time 200 ms each) were monitored in a positive ion mode at a 
high orifice potential (180 V), whereas full scans at m/z 300-2300 (0.2 
atomic mass unit steps, scan time 3.5 s) were acquired at a lower orifice 
potential (70 V). This way both intact parent ions and abundant marker 
ions were observed in the same m/z scan. 

Rotary Shadowing and Electron Microscopy of Recombinant Punc- 
tin — Rotary shadowing was done essentially as described previously 
(19). A 30-/il sample of punctin at 100 jtg/ml was mixed with 70 y\ of 
glycerol and nebulized onto freshly cleaved mica using an airbrush. The 
sample was dried in a vacuum, and rotary shadowed using a platinum- 
carbon electron beam gun angled at 6° relative to the mica surface 
within a Balzers BAE 250 evaporator. The replica was backed with 
carbon, floated onto distilled water, and picked up onto 600 mesh grids. 
Photomicrographs were taken using a Philips 410 electron microscope 
operated at 80 kV. 

Transient Expression of Tagged and Untagged Punctin in COS-1 
Cells — An internal Sacl site and a flanking Notl site were used to 
remove a 1.5-kb fragment of IMAGE clone 752797 and ligate it into 
corresponding sites in IMAGE clone 2150669 corresponding to the 5' 
end of the punctin cDNA to generate a complete ORF. EcoBl and Notl 
sites flanking this ORF were used to excise and clone the full-length 
coding sequence into pcDNA3.1/Myc-His (+) A (Invitrogen) for the 
■^expression of untagged punctin. To make constructs in which the AD- 
AMTS LI ORF was in-frame with a carboxyl-terminal FLAG tag or a 
tandem myc tag and 6X His tag, PGR was performed with Advantage 2 
polymerase (CLONTECH, Palo Alto, CA) using the full-length coding 
sequence as a template. The amplicons were cloned into the vectors 
pFLAG-CMV5c (Sigma) and pcDNA3.1/Myc-His B (Invitrogen) for ex- 
pression with either a COOH-terminal FLAG tag or a COOH-terminal 
tandem myc tag and 6x His tag, respectively. 

COS-1 cells (ATCC number CRL-1650) were grown on tissue culture 
plastic in Dulbecco's modified Eagle's medium :F- 12 (1:1) (Lerner Re- 
search Institute Media Services) supplemented with 10% fetal bovine 
serum (Invitrogen) and antibiotics (100 units/ml of penicillin and 50 
Mg/ml streptomycin). 10 5 cells between passages 3 and 10 were trans- 
fected with untagged, FLAG- tagged, or myc + 6X His- tagged punctin 
using FuGENE 6 (Roche Molecular Biochemicals) as per manufactur- 
er's recommendations, and cells were grown for an additional 24-48 h 
in serum-supplemented or serum-free medium. As a. control, cells were 
transfected with the respective vector alone without insert. The me- 
dium was collected and concentrated 10-fold. Cells were harvested after 
detachment with 10 mM EDTA for 10-15 min at 37 °C. A complete 
detachment of cells was confirmed by phase-contrast microscopy. Fifty 
microliters of 2x Laemmli sample buffer was added to the wells, and 
the ECM was scraped off. Samples of cell lysate, medium, and ECM 
were separately electrophoreses under reducing conditions (samples 
were boiled following the addition of 10% (vAO 2-mercaptoethanol) on 
12% SDS-polyacrylamide gels and transferred to enhanced chemilumi- 
nescence (ECL)-Hybond (Amersham Biosciences, Inc.). Western blot- 
ting was performed using either anti-FLAG M2 antibody (diluted 1:500, 
Sigma), anti-His (COOH-terminal) antibody (diluted 1:1000, Invitro- 
gen) or antibody 4113 (diluted 1:300) depending on the construct used 
for transfection. Antibody binding was detected using the appropriate 
peroxidase-labeled second antibody followed by ECL using reagents 
from Amersham Biosciences, Inc. 

For immunocytochemistry, COS-1 cells were grown on glass cover- 
slips in 35-mm diameter wells (in 6-well plates) and transiently trans- 
fected as described above in serum-supplemented or serum-free me- 
dium. The medium was removed 48 h after transfections. The cells were 
washed three times on ice with cold PBS containing 1 mM CaCLj and 1 
mM MgCl 2 and incubated for 1 h on ice with 1 ml of culture medium 
containing anti-FLAG M2 monoclonal antibody (diluted 1:300, Sigma) 
or anti-punctin rabbit antisera (diluted 1:100) with gentle shaking. 
Cells were washed four times for 3 min each with cold PBS, fixed in 4% 
paraformaldehyde (w/v in PBS) (Sigma) on ice for 30 min with gentle 
shaking and then washed three times with PBS at ambient tempera- 
ture. To quench free aldehyde groups, cells were treated with 75 mM 
ammonium chloride, 20 mM glycine for 10 min at ambient temperature, 
washed with PBS, and then blocked with 0.05% Triton X-100, 2% 
normal goat serum in PBS (10 min at ambient temperature). Finally, 
sections were incubated with the species-appropriate Texas Red-labeled 
goat secondary antibody (Jackson ImmunoResearch Laboratories, West 
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Grove, PA) prior to coverslip mounting in Vectashield containing 4',6- 
diamidino-2-phenylindole (Vector Laboratories, Inc., Burlingame, CA). 
The following control-immunostaining experiments were performed. 
COS-1 cells transfected with the vector alone or untransfected COS-1 
cells were stained with the above antibodies, or transfected cells were 
stained with preimmune serum from the rabbits in which the polyclonal 
antibodies were produced. 

To co-stain, punctin and the actin cytoskeleton, cells were stained 
with anti-FLAG or anti-punctin antibodies as described above with the 
exception that the secondary antibodies included incubation with Alexa 
488-phalloidin at recommended dilutions (Molecular Probes). In double 
immunostaining experiments following the immun ©localization of 
FLAG or punctin as described above, cells were permeabilized with 
0.1% Triton X-100 in PBS for 20 min prior to staining with (a) mono- 
clonal antibody to vinculin (1:100 dilution, Sigma) in combination with 
antiserum 4113 for the detection of punctin or (b) polyclonal antibody to 
focal adhesion kinase (1:200 dilution, Upstate Biotechnology, Lake 
Placid, NY) in combination with anti-FLAG monoclonal antibody M2 
(Sigma) for the detection of punctin. A Texas Red-labeled antibody 
(Jackson ImmunoResearch Laboratories) was used for the detection of 
punctin, and Alexa 488-conjugated antibody (Molecular Probes) was 
used for the detection of vinculin or focal adhesion kinase. 

RESULTS 

Cloning of Punctin cDNA — We identified a novel EST (Gen- 
Bank™ accession number AA482392) derived from pooled hu- 
man melanocyte, fetal heart, and pregnant uterus with homol- 
ogy to ADAMTS proteases. The 1.5-kb insert of the 
corresponding IMAGE clone 752797 contained a long ORF en- 
coding an amino-terminal TS domain, a cysteine-rich domain, a 
cysteine-free spacer domain, and three tandem TS modules 
followed by a short acidic peptide and stop codon (Fig. la). The 
stop codon and 3'-untranslated sequence were independently 
confirmed by 3 '-RACE (clone pSHTSLls3, Fig. la) as well as by 
another EST (GenBank™ accession number W47029). The 
3 '-untranslated region encoded in IMAGE clone 752797 con- 
tained a consensus polyadenylation signal (AATTAAA) fol- 
lowed by a poly(A) tail 14 nucleotides downstream. Completion 
of the full-length coding sequences by 5 '-RACE predicted a 
putative signal peptide upstream of the central TS domain. The 
signal peptide was preceded by a methionine codon within a 
satisfactory Kozak consensus sequence (A at -3, G at +4 
relative to ATG) (20) although there was no upstream in-frame 
stop codon. The 5' sequence obtained by RACE was subse- 
quently validated by independently cloned human and mouse 
ESTs (Genbank™ accession numbers A1459225 for human 
EST and AK020115 for mouse EST). The continuity of the 
cDNA clones was confirmed by PCR amplification of the full- 
length punctin ORF from human skeletal muscle cDNA (see 
below) as well as by identification of the encoding exons ar- 
ranged sequentially on human chromosome 9 (Celera Genom- 
ics, Rockville, MD). 

- Primary Structure of Punctin Predicts an ADAMTS-like Pro- 
tein — The predicted full-length punctin protein contains 525 
amino acids and has the typical domain structure of the ancil- 
lary noncatalytic regions of an ADAMTS protease (Fig; la). The . 
mature secreted form of punctin is 497 amino acids with a 
molecular mass of 55,240 Da and a calculated pi of 6.2. Like the 
ADAMTS proteases, each domain in punctin has an even num- 
ber of cysteine residues. This observation suggests that each 
domain may have internal disulfide bonds (17 such bonds are 
predicted in punctin), and that punctin consists of a series of 
independently-folded and disulfide-bonded domains. Punctin 
contains no other domains apart from those described previ- 
ously in the ADAMTS family. The punctin sequence contains 
one motif for iNMinked glycosylation (21) at Asn 223 (-Asn-X-Ser/ 
Thr-, where X is any amino acid except Pro) and also contains 
a total of 75 Thr and Ser residues, where O-linked glycosylation 
might occur. (Fig. 16). 

The overall punctin sequence is most similar to human AD- 
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Fig. 1. a, domain organization of punctin/ADAMTSL-1 shown rela- 
tive to ADAMTS- 1, the prototypic ADAMTS. The cloning strategy used 
for determination of the complete primary structure is shown. The 
location of each cDNA clone relative to the protein domains indicates 
the regions it encodes. The key to the domains is shown at the bottom 
of the figure. 6, the predicted amino acid sequence of punctin is shown 
using the single-letter amino acid code. TS modules are underlined with 
the thick tine and are numbered sequentially from amino to carboxyl 
terminus. A consensus sequence for /V-linked glycosylation is overlined. 
Cysteine residues are indicated by asterisks. The start of the spacer 
domain is indicated, the region between the NH 2 - terminal TS domain 
and the spacer domain is the cysteine-rich domain. The dashed line 
indicates the peptide used for the generation of antibodies. The arrow 
indicates the signal peptidase cleavage site. The arrowhead indicates a 
putative proteolytic processing site between TS domains 2 and 3. c, 
segregation of Adamtsll on mouse chromosome 4 in ((C3HfHeJ-gld x 
M. spretus) T x X C3H/HeJ-gta) interspecific backcross mice. Filled 
boxes represent the homozygous C3H pattern, and open boxes represent 
the F| pattern. The mapping of the reference loci in this interspecific 
cross has been previously described (15). 



AMTSL-3 (68% identity, see below). Of the ADAMTS enzymes 
published to date, punctin is most similar to human AD- 
AMTS- 10 (35% identity). The punctin TS domains have a 
higher degree of similarity to other ADAMTS-like proteins and 
ADAMTS proteases than to TSP1 and TSP2. The greatest 
similarities, as indicated by percentage of identity of amino 
acid sequences identified by BLAST searches of the first TS 
domain of punctin to TS domains from various molecules, are 
as follows: human AD AMTSL-3, 80%; human ADAMTS- 1, AD- 
AMTS-6, and ADAMTS- 10, 50%; mouse papilin, 47%; human 
ADAMTS-8, 44%; human ADAMTS-5, 42%; human TSP2, 40%; 
human TSP1, 38%. Like most TS domains in the ADAMTS 
family, punctin TS domains do not contain linear peptide se- 
quences found in TSP1 that have been defined as heparin or 
CD-36 binding sequences, (22). They do not contain degenerate 
GAG binding sequences such as BRXB, where B is the basic 
amino acid and X is any amino acid (22). 

Genomic Location of the Mouse and Human Punctin Genes 
and Tissue-specific Expression — The mapping of Adamtsll in 
an interspecific cross resulted in the following most probable 



gene order (mean ± S.D.): Ptprd-4A ±2.0 centimorgan-Arf- 
amtsll, Cdkn2a-\& ±1.2 centimorgan-«7im and placed Ad- 
amtsll at a consensus position of 42.6 centimorgan on mouse 
chromosome 4 (Fig. lc) in the vicinity of the interferon gene 
cluster. A search of the mouse genome data base (www.infor- 
matics.jax.org) did not reveal any pertinent genetic disorders 
near this locus. 

The human-mouse homology maps (www3.ncbi.nlm.nih.gov/ 
Omim/Homology/, accessed September 26, 2001) predict that 
the ADAMTSL1 locus is on human chromosome 9p21-22. The 
predicted locus was confirmed by the analysis of the human 
genome sequence. The punctin ORF is encoded by 13 exons 
spanning >250 kb of genomic DNA mapping to 9p2 1.2-22.1. A 
search of the Online Mendelian Inheritance in Man site 
(www3.ncbi.nlm.nih.gov/Omim/) revealed three unsolved hu- 
man disorders in the vicinity of the ADAMTSL1 locus. Diaph- 
yseal medullary stenosis with malignant fibrous histiocytoma 
(MIM 112250) is linked to 9p22-p21, Friedreich's ataxia 2 
(MIM601992) is linked to 9p23-pll, and neuropathy, distal 
hereditary motor, Jerash type (MIM605726) are linked to 
9p21.1-pl2. 

ADAMTS LI is primarily expressed in human and mouse 
skeletal muscle with a major message size of —7.0 kb in both 
species (Fig. 2). A minor messenger RNA species of —1.0 kb was 
also seen in some human tissues (Fig. 2, skeletal muscle, heart, 
colon, kidney, and liver). Expression was not detected in brain, 
colon, thymus, spleen, placenta, small intestine, lung, testis, 
ovary, or peripheral blood leukocytes. 

Expression and Characterization of Recombinant Punctin — 
Punctin expressed in High-Five™ cells with tandem COOH- 
terminal V5 and 6X His epitopes was secreted into the condi- 
tioned medium of adherent as well as suspension cultures. 
Punctin was detected by antibody 4113 and anti-epitope tag 
antibodies as a — 60-kDa band under reducing conditions. It 
was substantially purified from the culture medium using Ni- 
chromatography (Fig. 3a). The purification scheme yielded a 
maximum of 200 /xg/liter purified protein as determined by 
amino acid analysis. Electrophoresis and Western blotting of 
concentrated punctin preparations frequently demonstrated 
additional bands of molecular mass (—120 and —180 kDa, data 
not shown), suggesting the formation of dimers and trimers at 
high concentrations. 

The conformation of punctin appears to be maintained by 
disulfide bonds as evidenced by more rapid migration in SDS- 
PAGE under nonreducing conditions than under reducing con- 
ditions (Fig. 36). Furthermore, on Western blots under nonre- 
ducing conditions, the protein was not detectable with antibody 
4113 (data not shown), suggesting that the peptide epitope was 
not accessible without reduction of disulfide bonds. A mass 
analysis of His-tagged punctin by MALDI-TOF MS yielded a 
broad peak suggesting that the 60-kDa gel band contained 
major molecular species of 61,935 ± 595 and 60,873 ± 295 Da, 
respectively^ LC-ESMS analyses of the intact protein defined 
more precisely the major molecular species to be 61,970 ± 6 
and 62,131 ± 5, which are, respectively, 1979 and 2140 Da 
larger than the calculated mass (59,991) of tagged punctin 
based on amino acid sequence. NH 2 -terminal sequencing of the 
polyvinylidene d^uoride-immobilized 60-kDa protein revealed 
a single sequence, which commenced at Glu 29 (Le. Glu-Glu- 
Asp-Arg-Asp-Gly and so on). 

Recombinant Punctin Is Glycosylated — Two closely spaced 
punctin bands were resolved by Western blot analysis of con- 
ditioned medium or purified protein, although Coomassie Blue 
staining of purified punctin always demonstrated a single band 
(Fig. 3a). A periodic acid-Schiff-based method of staming car- 
bohydrate chains suggested that recombinant punctin is a gly- 
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FIG. 3. Analysis of epitope- tagged punctin purified by Ni-chro- 
matography from insect cell culture medium, a, Coomassie Blue 
(Simply Blue Safe Stain) staining of purified recombinant punctin on 
reducing SDS-PAGE (left lane) and Western blot analysis with anti- 
punctin antibody 4113 (right lane). 6, Western blot analysis using 
anti-His tag monoclonal antibody on reducing (left lane) and nonreduc- 
ing SDS-PAGE (right lane), c, glycoprotein staining of recombinant 
punctin (lane 2 contains 0.6 jig, and lane 3 contains 3 jig) using the 
periodic acid-Schiff procedure. Glycosylated Candy Cane™ markers (1 
fig/band) stained similarly are in lane 1. The arrow indicates stained 
punctin. d, Western analysis of culture medium from insect cell cultures 
treated without (left lane) or with (right lane) tunicamycin A for 72 h. 
Each lane contains 2.8 jig of total protein. Double arrowheads are used 
to indicate two molecular species seen on Western blots. 



FIG. 2. Northern analysis of expression of ADAMTSLl (left) and Adamtsll (right) in adult human and mouse tissues, respectively. 
Kilobase markers of RNA are shown at the left of each autoradiogram, and tissue origin is indicated above each lane. Hybridizing transcripts are 
indicated by arrows. 

characterized fully. Approximately 65% of the amino acid se- 
quence in punctin was identified by peptide mass mapping 
including the hTH 2 -terminal tryptic peptide (Glu 29 -Arg 47 ), ver- 
ifying that the target protein has been expressed. Based on the 
difference between the observed and calculated masses of in- 
tact punctin, the recombinant protein contains approximately 
3-4% carbohydrate by weight. 

During purification of punctin in the absence of protease 
inhibitors, additional components of —40 and 20 kDa, respec- 
tively, were detected on Coomassie Blue-stained gels (data not 
shown). The 40-kDa band contained two molecular species with 
measured masses of 38,409 ± 115 and 39,456 ± 156 Da, re- 
spectively, as determined by MALDI-TOF MS. The NH 2 -termi- 
nal sequencing of these bands yielded the same amino termi- 
nus as the full-length punctin. The ~20-kDa fragment 
exhibited an NH 2 -terminal sequence 372 DLYHPL, indicating 
that the fragment is from the carboxyl terminus. The addition 
of 1 mM phenylmethylsulfonyl fluoride to culture medium ef- 
fectively prevented this proteolysis, suggesting that it was ef- 
fected by a serine protease. 

Visualization of Punctin by Rotary Shadowing — Rotary 
shadowing of purified recombinant punctin demonstrated a 
hatchet-shaped or comma-shaped molecule 30-40 um in 
length (Fig. 4). Punctin consists of a single globular domain of 
10-20 um in size with a short linear segment at one end. Most 
of the visualized protein was in monomelic form (Fig. 4). Oc- 
casional aggregates with the appearance of dimers and trimers 
were seen but have not yet been resolved in detail. 

Expression and Localization of Punctin in Transacted COS-1 
Cells — Transfected cells were stained without fixation or per- 
meabilization and on ice (five staining) to prevent the detection 
of intracellular punctin or endocytosed antibody, respectively. 
Under these conditions, punctin was localized underneath the 
cells (i.e. adjacent to their ventral surface) in the substratum 
laid down on plastic. The staining pattern was punctate (Fig. 5, 
a-d) and was preferentially located toward the periphery of the 
cells (Fig. 5, a, b, and d) and under cellular processes (Fig. 5c). 
The punctin deposits were of submicron dimension, although 
fluorescent signals from closely located deposits were fre- 
quently merged suggesting larger aggregates. Transfected cells 
had minimal or no s tainin g on the dorsal cell surface. Punctin 
was not seen in the substratum in areas not corresponding to 
the cells. If cells were detached with 10 mM EDTA prior to 
staining, "footprints" of transfected cells were retained on the 
substratum with a similar staining pattern as under intact 



coprotein (Fig. 3c), and mass spectrometry demonstrated mul- 
tiple molecular species consistent with variable glycosylation. 
Treatment of recombinant protein with peptide iV-glycosidase 
F did not result in a perceptible decrease in molecular mass, 
although the intensity of glycoprotein staining was decreased 
(data not shown). Culture medium from tunicamycin-treated 
cells exhibited only a single punctin species as demonstrated by 
Western blotting (Fig. 3d). The difference (161 Da) between the 
LC-ESMS-observed masses of the major punctin molecular spe- 
cies (61,970 and 62,131 Da) is close to the in-chain chemical 
average mass of a oligosaccharide residue (Hex, 162). Minor 
molecular species were also apparent by LC-ESMS analysis, 
which differed by mass increments that approximated the in- 
chain chemical average mass of oligosaccharide residues (e.g. 
Hex, 162; HexNAc, 203; NeuAc, 291). For a further analysis, 
tryptic digests of the protein were examined by analytical LC- 
ESMS using stepped collision energy scanning to produce car- 
bohydrate-specific marker ions. Glycopeptides were detected 
including molecular species with masses of 5881.4 ± 0.4 and 
6171.2 ± 0.2 Da. The mass difference (289.8 Da) between these 
observed glycopeptides appears to correspond to the in-chain 
chemical average mass of iV-acetylneuraminic acid (NeuAc, 
291). Taken together, these data indicated that punctin is 
glycosylated, although specific glycopeptides have yet to be 
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Fig. 4. Rotary shadowing of recombinant punctin. a, overview. 
- b-g, images of individual punctin molecules. Scale bar in panel a 
indicates molecular dimensions in all panels. 



cells. Staining was seen in some areas not covered with cell 
processes. In other areas, there were cell processes without 
underlying punctin (Fig. 5c). We interpret this fin d in g to result 
from cellular motility (i.e. withdrawal of existing processes and 
the formation of new ones). Identical results were obtained 
with anti-FLAG monoclonal antibody or antibody 4113. Fig. 5, 
a-c, shows staining of FLAG-tagged protein using the FLAG 
M2 monoclonal antibody, and Fig. 5d shows staining with 
anti-punctin antiserum 4113. Similar staining patterns were 
seen whether cells were grown in the presence or absence of 
serum and using tagged or untagged proteins (data not shown). 

Double staining for vinculin (Fig. 5d) or focal adhesion ki- 
nase (data not shown), components of focal contacts, indicated 
that punctin staining did not correspond to sites of focal con- 
tacts. No staining was visible in control experiments, i.e. in 
untransfected COS cells, cells transfected with vector alone, 
cells stained without a primary antibody, or cells stained with 
preimmune serum as control. 

On Western blots, we found reactive protein bands of the 
expected size (58-60 kDa for untagged punctin and 62-64 kDa 
for the His-tagged or FLAG-tagged forms) in the medium, cell 
layer, and the underlying substratum or ECM of transfected 
COS-1 cells (Fig. 5e). In contrast, cells transfected with vector 
alone (Fig. 5e) or untransfected cells (data not shown) did not 
show a reactive band. As controls, preimmune serum from the 
rabbits in which anti-Punctin antibodies were generated did 
not produce immunoreactivity on Western blots (data not 
shown). 

DISCUSSION 

Punctin/ADAMTSL-1 Is a Novel ADAMTS-like Secreted 
Protein Belonging to a Distinct ADAMTSL Family of Pro- 
teins — In addition to missing the catalytic domain, the AD- 
AMTS-like proteins (see below) do not possess dismtegrin-like 
domains. This finding suggests that the dismtegrin-like do- 
main and catalytic domain may represent a functionally cou- 
pled protease domain in ADAMTS enzymes. Further evidence 
for this comes from the identification of other proteins with a 
predicted structure similar to punctin. Following the complete 
cloning of punctin/ADAMTSL-1, we became aware of a second 
such molecule encoded by the KIAA0605 gene (GenBank™ 
accession number AB011177) that we designated as AD- 
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Fig. 5. a-d, confocal laser-scanning microscopy of COS-1 cells follow- 
ing transient transfection with ADAMTSL1 expression constructs and 
immunocytochemistry. Untransfected cells are visible in a and b. Scale 
bar (10 jim) is shown at lower right of each panel, a and b, punctate 
staining of FLAG-tagged punctin (red) in nonpermeabilized cells visu- 
alized with anti-FLAG M2 antibody. Nuclei are blue 4',6-diamidino-2- 
phenylindole. c, relationship of punctin staining {red) visualized with 
anti-FLAG M2 monoclonal antibody to cellular actin as visualized by 
phalloidin staining (green). The asterisk indicates a cellular protrusion 
that does not have underlying punctin, and the arrow indicates punctin 
immunolocalization without an overlying cellular process, d, relation- 
ship of punctin (red) visualized with anti-punctin antiserum 4113 to 
vmculin staining (green) as shown by confocal imaging and overlay of 
single-color images from a double-stained cell, e, Western blot analysis 
of cell lysates (lane 1), medium (lane 2), and ECM (lane 3) from trans- 
fected COS-1 cells using an anti-His tag monoclonal antibody. Cell 
lysates from untransfected COS-1 cells are shown in lane 4. Molecular 
mass is indicated on the left. 

AMTSL-2 (23). We have cloned a third ADAMTS-like protein, 
ADAMTSL-3 (GenBank™ accession number AF237652). 3 
Therefore, punctin belongs to a distinct protein family. AD- 
AMTSL-2 and ADAMTSL-3 differ from punctin in their greater 
length (951 and 1690 amino acids, respectively) and also have 
more TS domains (6 and 10, respectively). These molecules will 
be described in greater detail in subsequent publications. In 
contrast to AD AMTSL-2 and ADAMTSL-3, which are quite 
widely expressed, 4 punctin/ADAMTSL-1 is selectively ex- 
pressed in muscle. 

Other secreted ECM molecules such as lacunin and papilin 
also contain the ancillary domains of the ADAMTS family in 
the precise order as punctin. However, punctin is more closely 
related to ADAMTSL-3 and some ADAMTS proteases than it is 
to mouse papilin (32% identity). Lacunin is a basement mem- 
brane glycoprotein in the moth Manduca sexta (24). Lacunin 
has the structure of ADAMTSL including seven TS modules as 



3 N. Moore, B. Anand-Apte, and S. Apte, unpublished data. 

4 S. Apte, unpublished data. 
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well as a single CO OH- terminal protease and lacunin domain. 
In addition, it contains 13 repeats of a novel lagrin domain, 11 
Kunitz inhibitor domains, 2 antistasin-like domains, 1 serine 
protease inhibitor domain, and 2 immunoglobulin domains. 
Lacunin localizes to the basal lamina of the moth wing (24). 
Papilin from Drosophila melanogaster may be an ortholog of M. 
sexta lacunin, because the two molecules are similar in their 
domain content, organization, and primary sequence. Papilin is 
also a basement membrane protein (25). Although these inver- 
tebrate proteins have numerous protease inhibitor domains, 
mammalian papilin contains substantially fewer such domains 
(25). 

Characterization of Recombinant Punctin from Insect Cells — 
Our experimental data support the likelihood that recombinant 
punctin is disulfide-bonded. First, its electrophoretic mobility 
is greater under nonreducing conditions. Second, the punctin 
epitope is masked under nonreducing conditions. Third, rotary 
shadowing demonstrated a molecule with a specific and con- 
sistent conformation. Limited proteolysis within the linker 
peptide, connecting TS domains 2 and 3 assigned to the Tyr 371 - 
Asp 372 peptide bond (Fig. 16) by a putative serine protease, 
indicates that there may be a proteolytically susceptible ex- 
posed region between the two disulfide-bonded TS domains. It 
is -not yet known whether this is a physiologically relevant 
processing or whether it is an artifact that is unique to this 
expression system. The processing event releases the two 
COOH-terminal TS domains of punctin. Because proteolyti- 
cally derived fragments of many secreted proteins have distinc- 
tive functions, it will be interesting to investigate whether 
specific functions are associated with the —40- and ~20-kDa 
fragments. 

A mass measurement of epitope-tagged recombinant punctin 
by MALDI-TOF MS and LC-ESMS revealed that purified punc- 
tin contained multiple species of higher than the predicted 
mass. Edman degradation indicated that all these species had 
the same amino terminus. Further MS analysis, glycoprotein 
staining, and culture in the presence of tunicamycin A confirm 
that punctin contains JV-linked sugars but do not exclude the 
presence of O-linked sugar. Significant alteration of mobility 
was not seen after peptide iV-glycosidase F treatment, suggest- 
ing that the iV-linked carbohydrate may be resistant to com- 
plete enzymatic removal (26). 

Rotary shadowing is useful for demonstrating the physical 
conformation of a molecule as well as the existence of oligo- 
meric complexes (27-29). The data we have obtained for punc- 
tin are relevant to the ADAMTS, lacunin, and papilin. They can 
be extrapolated to represent the structure of the ancillary 
domains of an ADAMTS enzyme and the "papilin cassette" (25) 
and provide the first insight into the conformation of these 
domain assemblies. Many ECM proteins exist as oligomers. 
This observation may also be the case with punctin, because 
rotary shadowing electron microscopy and gel electrophoresis 
occasionally suggested the presence of dinners and trimers. We 
anticipate that rotary shadowing will be useful for future stud- 
ies to investigate punctin oligomerization and interactions of 
punctin with putative ECM ligands. 

Punctin Is an ECM Glycoprotein That Binds to the Cell 
Substratum in a Spatially Specific Manner — Nontransformed 
cells in culture require a substratum for attachment, spread- 
ing, and migration. The substratum present on an unmodified 
plastic tissue culture surface is derived from the cells them- 
selves as well as from proteins in serum-supplemented culture 
medium (30-32). Quantitatively significant components of the 
cell substratum are laminin, fibronectin, vitronectin, collagen, 
tenascin, PG-M or versican (a chondroitin sulfate proteogly- 
can), perlecan (a heparan sulfate proteoglycan), hyaluronan, 
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and tissue inhibitor of metalloproteases-3 (30-37). Punctin 
shares the subcellular distribution of molecules that do not 
generally co-localize with focal contacts (e.g. versican, hyaluro- 
nan, and tenascin) (31, 37). Because punctin is left behind in 
the ECM after cell detachment with EDTA, we conclude that 
when expressed in COS-1 cells, punctin binds a component of 
the ECM. Punctin in culture medium may reflect an excess of 
more than that which can bind to the substratum or indicate 
secretion from the free surface of the cell. Punctin does not bind 
to ECM between the cells, indicating that the punctin ligand is 
absent from these regions. Because similar staining was seen 
under serum-supplemented as well as under serum-free cul- 
ture conditions, it is probable that the ECM binding partner of 
punctin is a molecule produced by COS-1 cells but not one 
derived from fetal bovine serum. 

Significance of Punctin and the ADAMTS-like Family — Mol- 
ecules comprising ancillary domains of metalloproteases may 
be generated in biological systems by proteolytic processing or 
through alternative splicing of protease genes. Brooks et al. 
(38) found that the proteolytically generated hemopexin do- 
main of MMP-2 circulated in serum and bound to the integrin 
ctvftj. This MMP-2 fragment inhibited angiogenesis by prevent- 
ing membrane targeting of MMP-2 (38). So far, there are no 
known examples of ADAMTS-like proteins generated as splice 
variants of ADAMTS genes. The discovery of punctin demon- 
strates for the first time the existence of molecules closely 
resembling the ancillary domains of ADAMTS that are gener- 
ated as distinct gene products. 

The resemblance of ADAMTS L to ADAMTS suggests a func- 
tional relationship between these two groups of molecules. 
From studies on ADAMTS- 1 (39) and ADAMTS -2 (40), it is 
known that the ancillary domains are required to bind and 
cleave substrates. ADAMTSL may offer a potential mechanism 
of ADAMTS regulation via one of several possible mechanisms. 
As a result of noncompetitive inhibition of ADAMTS-2, an 
inhibitory role has been shown for Drosophila papilin (25). 
Another possibility is that punctin may compete with ADAMTS 
for its substrates and protect the substrates from cleavage. The 
isolated MMP-2 hemopexin domain represents one such exam- 
ple. In a second example, a truncated nonenzymatic version of 
ADAM- 17 was shown to have a dominant negative effect on the 
activation of tumor necrosis factor-or (41). An intriguing possi- 
bility is that the ADAMTS-like proteins may be enhancers of 
the ADAMTS proteases. For example, the procollagen C-pro- 
teinase enhancer protein (42) contains two domains homolo- 
gous to those found in the C-proteinase that are instrumental 
in binding to the carboxyl propeptide of procollagen I and 
enhancing its removal (43). Very little is currently known about 
the regulation of ADAMTS proteases following their activation, 
and it is possible that the ADAMTS-like proteins may provide 
a novel general principle of regulation. 
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